scovich commented on code in PR #8093: URL: https://github.com/apache/arrow-rs/pull/8093#discussion_r2271384578
########## parquet-variant-compute/src/variant_array.rs: ########## @@ -434,15 +439,17 @@ mod test { } #[test] - fn invalid_missing_value() { + fn all_null_missing_value_and_typed_value() { let fields = Fields::from(vec![Field::new("metadata", DataType::BinaryView, false)]); let array = StructArray::new(fields, vec![make_binary_view_array()], None); - // Should fail because the StructArray does not contain a 'value' field - let err = VariantArray::try_new(Arc::new(array)); - assert_eq!( - err.unwrap_err().to_string(), - "Invalid argument error: VariantArray has neither value nor typed_value field" - ); + // Should succeed and create an AllNull variant when neither value nor typed_value are present + let variant_array = VariantArray::try_new(Arc::new(array)).unwrap(); Review Comment: By a strict reading of the spec, this should actually fail, because this `ShreddingState` does not represent a shredded object field, but rather represents a top-level variant value: > value | typed_value | Meaning > -|-|- > null | null | The value is missing; only valid for shredded object fields But maybe that's a validation `VariantArray::try_new` should perform, not `ShreddingState::try_new`? Also, it would quickly become annoying if `variant_get` has to replace a missing or all-null `value` column with an `all-Variant::Null` column just to comply with the spec? Maybe that's why there's this additional tidbit? > If a Variant is missing in a context where a value is required, readers must return a Variant null (`00`): basic type 0 (primitive) and physical type 0 (null). For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned value must be `00` (Variant null). ########## parquet-variant-compute/src/variant_get/output/primitive.rs: ########## @@ -155,6 +156,19 @@ impl<'a, T: ArrowPrimitiveVariant> OutputBuilder for PrimitiveOutputBuilder<'a, "variant_get unshredded to primitive types is not implemented yet", ))) } + + fn all_null( + &self, + variant_array: &VariantArray, + _metadata: &BinaryViewArray, + ) -> Result<ArrayRef> { + // For all-null case, create a primitive array with all null values + let nulls = NullBuffer::from(vec![false; variant_array.len()]); + let values = vec![T::default_value(); variant_array.len()]; + let array = PrimitiveArray::<T>::new(values.into(), Some(nulls)) + .with_data_type(self.as_type.data_type().clone()); + Ok(Arc::new(array)) Review Comment: ```suggestion Ok(Arc::new(new_null_array(self.as_type.data_type(), variant_array.len()))) ``` ########## parquet-variant-compute/src/variant_array.rs: ########## @@ -187,6 +187,7 @@ impl VariantArray { typed_value_to_variant(typed_value, index) } } + ShreddingState::AllNull { .. } => Variant::Null, Review Comment: This is tricky... see https://github.com/apache/arrow-rs/pull/8122#discussion_r2271263385 * For a top-level variant, null/null is illegal (tho returning `Variant::Null` is arguably a correct way to compensate) * For a shredded variant field, null/null means SQL NULL, and returning `Variant::Null` is arguably incorrect (causes SQL `IS NULL` operator to return `FALSE`). But we don't even _have_ a way to return SQL NULL here (it would probably correspond to `Option::None`?) ########## parquet-variant-compute/src/variant_array.rs: ########## @@ -488,4 +495,50 @@ mod test { fn make_binary_array() -> ArrayRef { Arc::new(BinaryArray::from(vec![b"test" as &[u8]])) } + + #[test] + fn all_null_shredding_state() { + let metadata = BinaryViewArray::from(vec![b"test" as &[u8]]); + let shredding_state = ShreddingState::try_new(metadata.clone(), None, None).unwrap(); + + assert!(matches!(shredding_state, ShreddingState::AllNull { .. })); + + // Verify metadata is preserved correctly + if let ShreddingState::AllNull { metadata: m } = shredding_state { + assert_eq!(m.len(), metadata.len()); + assert_eq!(m.value(0), metadata.value(0)); + } + } + + #[test] + fn all_null_variant_array_construction() { + let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]); + let nulls = NullBuffer::from(vec![false, false, false]); // all null Review Comment: Ah -- the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects) requires that the struct ("group") containing `value` and `typed_value` columns must be non-nullable: > The group for each named field must use repetition level `required`. A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL] `null` value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null). So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability. ########## parquet-variant-compute/src/variant_array.rs: ########## @@ -488,4 +495,50 @@ mod test { fn make_binary_array() -> ArrayRef { Arc::new(BinaryArray::from(vec![b"test" as &[u8]])) } + + #[test] + fn all_null_shredding_state() { + let metadata = BinaryViewArray::from(vec![b"test" as &[u8]]); + let shredding_state = ShreddingState::try_new(metadata.clone(), None, None).unwrap(); + + assert!(matches!(shredding_state, ShreddingState::AllNull { .. })); + + // Verify metadata is preserved correctly + if let ShreddingState::AllNull { metadata: m } = shredding_state { + assert_eq!(m.len(), metadata.len()); + assert_eq!(m.value(0), metadata.value(0)); + } + } + + #[test] + fn all_null_variant_array_construction() { + let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]); + let nulls = NullBuffer::from(vec![false, false, false]); // all null Review Comment: From what I understand, the treatment of NULL/NULL depends on context: * For a top-level variant value, it's interpreted as `Variant::Null` * For a shredded variant object field, it's interpreted as missing (SQL NULL) So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org