scovich commented on code in PR #8093:
URL: https://github.com/apache/arrow-rs/pull/8093#discussion_r2271384578
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -434,15 +439,17 @@ mod test {
}
#[test]
- fn invalid_missing_value() {
+ fn all_null_missing_value_and_typed_value() {
let fields = Fields::from(vec![Field::new("metadata",
DataType::BinaryView, false)]);
let array = StructArray::new(fields, vec![make_binary_view_array()],
None);
- // Should fail because the StructArray does not contain a 'value' field
- let err = VariantArray::try_new(Arc::new(array));
- assert_eq!(
- err.unwrap_err().to_string(),
- "Invalid argument error: VariantArray has neither value nor
typed_value field"
- );
+ // Should succeed and create an AllNull variant when neither value nor
typed_value are present
+ let variant_array = VariantArray::try_new(Arc::new(array)).unwrap();
Review Comment:
By a strict reading of the spec, this should actually fail, because this
`ShreddingState` does not represent a shredded object field, but rather
represents a top-level variant value:
> value | typed_value | Meaning
> -|-|-
> null | null | The value is missing; only valid for shredded object
fields
But maybe that's a validation `VariantArray::try_new` should perform, not
`ShreddingState::try_new`?
Also, it would quickly become annoying if `variant_get` has to replace a
missing or all-null `value` column with an `all-Variant::Null` column just to
comply with the spec? Maybe that's why there's this additional tidbit?
> If a Variant is missing in a context where a value is required, readers
must return a Variant null (`00`): basic type 0 (primitive) and physical type 0
(null). For example, if a Variant is required (like `measurement` above) and
both `value` and `typed_value` are null, the returned value must be `00`
(Variant null).
##########
parquet-variant-compute/src/variant_get/output/primitive.rs:
##########
@@ -155,6 +156,19 @@ impl<'a, T: ArrowPrimitiveVariant> OutputBuilder for
PrimitiveOutputBuilder<'a,
"variant_get unshredded to primitive types is not implemented yet",
)))
}
+
+ fn all_null(
+ &self,
+ variant_array: &VariantArray,
+ _metadata: &BinaryViewArray,
+ ) -> Result<ArrayRef> {
+ // For all-null case, create a primitive array with all null values
+ let nulls = NullBuffer::from(vec![false; variant_array.len()]);
+ let values = vec![T::default_value(); variant_array.len()];
+ let array = PrimitiveArray::<T>::new(values.into(), Some(nulls))
+ .with_data_type(self.as_type.data_type().clone());
+ Ok(Arc::new(array))
Review Comment:
```suggestion
Ok(Arc::new(new_null_array(self.as_type.data_type(),
variant_array.len())))
```
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -187,6 +187,7 @@ impl VariantArray {
typed_value_to_variant(typed_value, index)
}
}
+ ShreddingState::AllNull { .. } => Variant::Null,
Review Comment:
This is tricky... see
https://github.com/apache/arrow-rs/pull/8122#discussion_r2271263385
* For a top-level variant, null/null is illegal (tho returning
`Variant::Null` is arguably a correct way to compensate)
* For a shredded variant field, null/null means SQL NULL, and returning
`Variant::Null` is arguably incorrect (causes SQL `IS NULL` operator to return
`FALSE`). But we don't even _have_ a way to return SQL NULL here (it would
probably correspond to `Option::None`?)
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -488,4 +495,50 @@ mod test {
fn make_binary_array() -> ArrayRef {
Arc::new(BinaryArray::from(vec![b"test" as &[u8]]))
}
+
+ #[test]
+ fn all_null_shredding_state() {
+ let metadata = BinaryViewArray::from(vec![b"test" as &[u8]]);
+ let shredding_state = ShreddingState::try_new(metadata.clone(), None,
None).unwrap();
+
+ assert!(matches!(shredding_state, ShreddingState::AllNull { .. }));
+
+ // Verify metadata is preserved correctly
+ if let ShreddingState::AllNull { metadata: m } = shredding_state {
+ assert_eq!(m.len(), metadata.len());
+ assert_eq!(m.value(0), metadata.value(0));
+ }
+ }
+
+ #[test]
+ fn all_null_variant_array_construction() {
+ let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]);
+ let nulls = NullBuffer::from(vec![false, false, false]); // all null
Review Comment:
Ah -- the
[spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects)
requires that the struct ("group") containing `value` and `typed_value`
columns must be non-nullable:
> The group for each named field must use repetition level `required`. A
field's `value` and `typed_value` are set to null (missing) to indicate that
the field does not exist in the variant. To encode a field that is present with
a [variant/JSON, not SQL] `null` value, the `value` must contain a Variant
null: basic type 0 (primitive) and physical type 0 (null).
So effectively, the NULL/NULL combo becomes the null mask for that nested
field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level
field already has its own nullability.
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -488,4 +495,50 @@ mod test {
fn make_binary_array() -> ArrayRef {
Arc::new(BinaryArray::from(vec![b"test" as &[u8]]))
}
+
+ #[test]
+ fn all_null_shredding_state() {
+ let metadata = BinaryViewArray::from(vec![b"test" as &[u8]]);
+ let shredding_state = ShreddingState::try_new(metadata.clone(), None,
None).unwrap();
+
+ assert!(matches!(shredding_state, ShreddingState::AllNull { .. }));
+
+ // Verify metadata is preserved correctly
+ if let ShreddingState::AllNull { metadata: m } = shredding_state {
+ assert_eq!(m.len(), metadata.len());
+ assert_eq!(m.value(0), metadata.value(0));
+ }
+ }
+
+ #[test]
+ fn all_null_variant_array_construction() {
+ let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]);
+ let nulls = NullBuffer::from(vec![false, false, false]); // all null
Review Comment:
From what I understand, the treatment of NULL/NULL depends on context:
* For a top-level variant value, it's interpreted as `Variant::Null`
* For a shredded variant object field, it's interpreted as missing (SQL NULL)
So I guess there are two ways to get SQL NULL -- null mask on the
struct(value, typed_value), or if both value and typed_value are themselves
NULL?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]