jorgecarleitao commented on a change in pull request #8546:
URL: https://github.com/apache/arrow/pull/8546#discussion_r513933731
##########
File path: rust/arrow/src/array/data.rs
##########
@@ -209,6 +209,61 @@ impl ArrayData {
}
}
+impl PartialEq for ArrayData {
+ fn eq(&self, other: &Self) -> bool {
+ assert_eq!(
+ self.data_type(),
+ other.data_type(),
+ "Data types not the same"
+ );
+ assert_eq!(self.len(), other.len(), "Lengths not the same");
+ // TODO: when adding tests for this, test that we can compare with
arrays that have offsets
+ assert_eq!(self.offset(), other.offset(), "Offsets not the same");
+ assert_eq!(self.null_count(), other.null_count());
+ // compare buffers excluding padding
+ let self_buffers = self.buffers();
+ let other_buffers = other.buffers();
+ assert_eq!(self_buffers.len(), other_buffers.len());
+ self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
+ compare_buffer_regions(
+ s,
+ self.offset(), // TODO mul by data length
+ o,
+ other.offset(), // TODO mul by data len
+ );
+ });
+ // assert_eq!(self.buffers(), other.buffers());
+
+ assert_eq!(self.child_data(), other.child_data());
+ // null arrays can skip the null bitmap, thus only compare if there
are no nulls
+ if self.null_count() != 0 || other.null_count() != 0 {
+ compare_buffer_regions(
+ self.null_buffer().unwrap(),
+ self.offset(),
+ other.null_buffer().unwrap(),
+ other.offset(),
+ )
+ }
Review comment:
I can't understand this part of the code.
1. why are these asserts instead of equalities? Doesn't this panic? on the
comparison?
2. why do we only compare buffers when there are no nulls? What about the
other cases?
I am trying to understand the goal here: IMO there are two equalities that
we can do to the `ArrayData`: a "physical equality" and a "logical equality".
The physical equality compares the data (in bytes) on the `ArrayData`, and
two arrays are equal when that data is equal. For example, two `ArrayData` with
1 buffer each and a null bitmap of the form:
```
# (buffer, null_bit_as_bools)
a = ([1, 2], [true, false])
b = ([1, 0], [true, false])
```
are different under this equality because their data is different.
Logical equality interprets the data on the `ArrayData` according to the
spec and performs a comparison accordingly. In the example above, the two
arrays are equal because the `2` and `0` become undefined due to the null bit.
Is this code trying to achieve the former equality, or the latter? If it is
the former, then:
* it needs to take into account that buffer equality for the null bitmap and
boolean arrays is different and
* the condition `if self.null_count() != 0 || other.null_count() != 0`
should not be there as this is a physical comparison that ignores null counts.
If it is trying to achieve logical equality, then it has a lot of issues as
buffer and child equality are not that simple.
Coincidentally, I have been looking at the logical comparison of `ArrayData`
as part of #8541 .
##########
File path: rust/parquet/Cargo.toml
##########
@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT", optional = true }
+base64 = { version = "*", optional = true }
Review comment:
shouldn't we bracket this, to avoid backward incompatibilities if
`base64` releases a major version?
##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -1194,6 +1372,17 @@ mod tests {
});
}
+ #[test]
+ #[should_panic(expected = "Parquet does not support writing empty
structs")]
+ fn test_empty_struct_field() {
+ let arrow_fields = vec![Field::new("struct", DataType::Struct(vec![]),
false)];
+ let arrow_schema = Schema::new(arrow_fields);
+ let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema);
+
+ assert!(converted_arrow_schema.is_err());
+ converted_arrow_schema.unwrap();
+ }
Review comment:
Instead of `should_panic`, could we test the error type and message?
`should_panic` is generally reserved to catch panics, not to raise the panic in
the test itself.
##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -267,9 +444,6 @@ fn arrow_to_parquet_type(field: &Field) -> Result<Type> {
let dict_field = Field::new(name, *value.clone(),
field.is_nullable());
arrow_to_parquet_type(&dict_field)
}
- DataType::LargeUtf8 | DataType::LargeBinary | DataType::LargeList(_)
=> {
- Err(ArrowError("Large arrays not supported".to_string()))
- }
Review comment:
❤️
##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -200,8 +200,7 @@ pub fn can_cast_types(from_type: &DataType, to_type:
&DataType) -> bool {
(Timestamp(_, _), Date32(_)) => true,
(Timestamp(_, _), Date64(_)) => true,
// date64 to timestamp might not make sense,
-
- // end temporal casts
+ (Null, Int32) => true,
Review comment:
Shouldn't we support (literally) all types?
##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -2646,11 +2651,25 @@ mod tests {
// Cast to a dictionary (same value type, Utf8)
let cast_type = Dictionary(Box::new(UInt8), Box::new(Utf8));
- let cast_array = cast(&array, &cast_type).expect("cast succeeded");
+ let cast_array = cast(&array, &cast_type).expect("cast failed");
assert_eq!(cast_array.data_type(), &cast_type);
assert_eq!(array_to_strings(&cast_array), expected);
}
+ #[test]
+ fn test_cast_null_array_to_int32() {
+ let array = Arc::new(NullArray::new(6)) as ArrayRef;
+
+ let expected = Int32Array::from(vec![None; 6]);
+
+ // Cast to a dictionary (same value type, Utf8)
Review comment:
```suggestion
// Cast to a int32
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]