alamb commented on a change in pull request #810:
URL: https://github.com/apache/arrow-rs/pull/810#discussion_r735112883
##########
File path: arrow/src/array/array_list.rs
##########
@@ -552,9 +552,10 @@ mod tests {
assert!(!list_array.is_null(i));
}
- // Now test with a non-zero offset
+ // Now test with a non-zero offset (skip first element)
+ // [[3, 4, 5], [6, 7]]
let list_data = ArrayData::builder(list_data_type)
- .len(3)
+ .len(2)
Review comment:
This is a similar invalid test that the new checks identified --
`value_data` has only three items in it, so doing offset = 1 and len = 3 can
potentially read off the end. This change corrects the len to be within bounds.
There is a very similar (and thus fixed) test in array_list.rs and one in
array_map.rs
##########
File path: arrow/src/array/data.rs
##########
@@ -559,6 +570,458 @@ impl ArrayData {
)
}
}
+
+ /// "cheap" validation of an `ArrayData`. Ensures buffers are
+ /// sufficiently sized to store `len` + `offset` total elements of
+ /// `data_type` and performs other inexpensive consistency checks.
+ ///
+ /// This check is "cheap" in the sense that it does not validate the
+ /// contents of the buffers (e.g. that all offsets for UTF8 arrays
+ /// are within the bounds of the values buffer).
+ ///
+ /// TODO: add a validate_full that validates the offsets
+ pub fn validate(&self) -> Result<()> {
+ // Need at least this mich space in each buffer
+ let len_plus_offset = self.len + self.offset;
+
+ // Check that the data layout conforms to the spec
+ let layout = layout(&self.data_type);
+
+ // Handling of nulls in `UnionArray` does not seem to conform
+ // to the arrow spec, so skip this check
+ //
+ // Tracking tickets:
+ //
+ // https://github.com/apache/arrow-rs/issues/814
+ // https://github.com/apache/arrow-rs/issues/85
+ if matches!(&self.data_type, DataType::Union(..)) {
+ return Ok(());
Review comment:
The "safe" way here would be to fail validation always for `Union` to
signal to the user that the validation checks are not correct.
I think a better story would be to fix #85 and #814 . I am torn on what to
do in this case
##########
File path: arrow/src/array/array_list.rs
##########
@@ -763,11 +765,12 @@ mod tests {
Box::new(Field::new("item", DataType::Int32, false)),
3,
);
- let list_data = ArrayData::builder(list_data_type)
- .len(3)
- .add_child_data(value_data)
- .build()
- .unwrap();
+ let list_data = unsafe {
Review comment:
this test is checking the ListArray specific error, but since the new
ArrayData validation checks now catch the problem we need to switch to using
`build_unchecked()` to exercise the same code path
##########
File path: arrow/src/array/array_boolean.rs
##########
@@ -332,10 +332,11 @@ mod tests {
#[should_panic(expected = "BooleanArray data should contain a single
buffer only \
(values buffer)")]
fn test_boolean_array_invalid_buffer_len() {
- let data = ArrayData::builder(DataType::Boolean)
- .len(5)
- .build()
- .unwrap();
+ let data = unsafe {
Review comment:
`build()` now fails the earlier validation check -- so to keep the test
checking the internal BooleanArray checks need to use `unsafe`
Note that it might be best to eventually remove all Array specific checks
(which I think will be redundant) in favor of consolidated checks in
ArrayData.rs but I don't want to consider doing that until I have the
validation checks completed
##########
File path: arrow/src/array/array_binary.rs
##########
@@ -891,10 +891,18 @@ mod tests {
assert!(binary_array.is_valid(i));
assert!(!binary_array.is_null(i));
}
+ }
+
+ #[test]
+ fn test_binary_array_with_offsets() {
+ let values: [u8; 12] = [
+ b'h', b'e', b'l', b'l', b'o', b'p', b'a', b'r', b'q', b'u', b'e',
b't',
+ ];
+ let offsets: [i32; 4] = [0, 5, 5, 12];
// Test binary array with offset
let array_data = ArrayData::builder(DataType::Binary)
- .len(4)
+ .len(2)
Review comment:
the array data in this case is only 4 elements long, so using an offset
of 1 with len 4 is incorrect (goes off the end of the array data). The same
with the test below
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]