jhorstmann commented on a change in pull request #810:
URL: https://github.com/apache/arrow-rs/pull/810#discussion_r742905920



##########
File path: arrow/src/array/data.rs
##########
@@ -559,6 +570,452 @@ impl ArrayData {
             )
         }
     }
+
+    /// "cheap" validation of an `ArrayData`. Ensures buffers are
+    /// sufficiently sized to store `len` + `offset` total elements of
+    /// `data_type` and performs other inexpensive consistency checks.
+    ///
+    /// This check is "cheap" in the sense that it does not validate the
+    /// contents of the buffers (e.g. that all offsets for UTF8 arrays
+    /// are within the bounds of the values buffer).
+    ///
+    /// TODO: add a validate_full that validates the offsets and valid utf8 
data
+    pub fn validate(&self) -> Result<()> {
+        // Need at least this mich space in each buffer
+        let len_plus_offset = self.len + self.offset;
+
+        // Check that the data layout conforms to the spec
+        let layout = layout(&self.data_type);
+
+        // Will validate Union when conforms to new spec:
+        // https://github.com/apache/arrow-rs/issues/85
+        if matches!(&self.data_type, DataType::Union(_)) {
+            return Ok(());
+        }
+        if self.buffers.len() != layout.buffers.len() {
+            return Err(ArrowError::InvalidArgumentError(format!(
+                "Expected {} buffers in array of type {:?}, got {}",
+                layout.buffers.len(),
+                self.data_type,
+                self.buffers.len(),
+            )));
+        }
+
+        for (i, (buffer, spec)) in
+            self.buffers.iter().zip(layout.buffers.iter()).enumerate()
+        {
+            match spec {
+                BufferSpec::FixedWidth { byte_width } => {
+                    let min_buffer_size = len_plus_offset
+                        .checked_mul(*byte_width)
+                        .expect("integer overflow computing min buffer size");
+
+                    if buffer.len() < min_buffer_size {
+                        return Err(ArrowError::InvalidArgumentError(format!(
+                            "Need at least {} bytes in buffers[{}] in array of 
type {:?}, but got {}",
+                            min_buffer_size, i, self.data_type, buffer.len()
+                        )));
+                    }
+                }
+                BufferSpec::VariableWidth => {
+                    // not cheap to validate (need to look at the
+                    // data). Partially checked in validate_offsets
+                    // called below. Can check with `validate_full`
+                }
+                BufferSpec::BitMap => {
+                    let min_buffer_size = bit_util::ceil(len_plus_offset, 8);
+                    if buffer.len() < min_buffer_size {
+                        return Err(ArrowError::InvalidArgumentError(format!(
+                            "Need at least {} bytes for bitmap in buffers[{}] 
in array of type {:?}, but got {}",
+                            min_buffer_size, i, self.data_type, buffer.len()
+                        )));
+                    }
+                }
+                BufferSpec::AlwaysNull => {
+                    // Nothing to validate
+                }
+            }
+        }
+
+        if self.null_count > self.len {

Review comment:
       I'm wondering what could happen if the `null_count` actually gives a 
different number than the validity buffer and whether this could lead to 
undefined behavior in some later operation. Most callers pass `None` anyway, so 
we calculate a number which is guaranteed to be in range. I would suggest to 
remove the `null_count` parameter from `try_new` to completely avoid of 
inconsistencies.
   
   Kernels that want to avoid the overhead of counting bits could use the 
unsafe `new_unchecked` method. I see it is currently set by the 
`cast_array_data` function. To make that one safe while avoiding bit counting, 
we could just validate that the `from` and `to` layouts are equal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to