scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550215532


##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
         return Ok(ArrayRef::from(target));
     };
 
+    // Try to return the typed value directly when we have a perfect shredding 
match.
+    if !matches!(as_field.data_type(), DataType::Struct(_)) {
+        if let Some(typed_value) = target.typed_value_field() {
+            let types_match = typed_value.data_type() == as_field.data_type();
+            let value_not_present = target.value_field().is_none();
+            // this is a perfect shredding, where the value is entirely 
shredded out, so we can just return the typed value
+            // note that we MUST check value_not_present, because some of the 
`typed_value` might be null but data is present in the `value` column.
+            // an alternative is to count whether typed_value has any 
non-nulls, or check every row in `value` is null,
+            // but this is too complicated and might be slow.

Review Comment:
   I'm pretty sure 
[NullBuffer::null_count](https://docs.rs/arrow-buffer/57.0.0/src/arrow_buffer/buffer/null.rs.html#144)
 is a stored (not computed) value, so:
   ```rust
   let value_not_present = target.value_field().is_none_or(|v| v.null_count() 
== v.len());
   ```
   
   Also: Double check line wrapping here? (seems like a lot more than 100 chars)



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
     }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+    array: &ArrayRef,
+    parent_nulls: Option<&NullBuffer>,
+) -> Result<ArrayRef> {
+    let Some(parent_nulls) = parent_nulls else {
+        return Ok(array.clone());
+    };
+    if parent_nulls.null_count() == 0 {
+        return Ok(array.clone());
+    }
+
+    let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   I don't think we need to actually combine nulls? 
   
   For all types other than variant object (which we explicitly skip), the 
shredding spec requires exactly one of `value` or `typed_value` to be non-NULL 
in each row, with SQL NULL values encoded as `Variant::Null` in the `value` 
column. 
   
   => The presence of any NULL values in a `typed_value` column breaks perfect 
shredding (even if it's a true SQL NULL)
   => The child array must have null_count 0 if we reached this point
   => We can just apply the parent nulls to the child array and be done with it
   
   Also: https://github.com/apache/arrow-rs/issues/6528 seems relevant here?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
     }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+    array: &ArrayRef,
+    parent_nulls: Option<&NullBuffer>,
+) -> Result<ArrayRef> {
+    let Some(parent_nulls) = parent_nulls else {
+        return Ok(array.clone());
+    };
+    if parent_nulls.null_count() == 0 {
+        return Ok(array.clone());
+    }
+
+    let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
       let perfectly_shredded = typed_value.null_count() == 0;
       let as_type = as_field.data_type();
       if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
           let mut array = match target.nulls() {
               None => typed_value.clone(),
               Some(nulls) => {
                   let builder = array.to_data().into_builder();
                   make_array(builder.nulls(Some(nulls)).build()?)
               }
           };
           if typed_value.data_type() != as_type {
               array = cast(array, as_type)?;
           }
           Ok(array)
       }
   }
   ```



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
         return Ok(ArrayRef::from(target));
     };
 
+    // Try to return the typed value directly when we have a perfect shredding 
match.
+    if !matches!(as_field.data_type(), DataType::Struct(_)) {
+        if let Some(typed_value) = target.typed_value_field() {
+            let types_match = typed_value.data_type() == as_field.data_type();
+            let value_not_present = target.value_field().is_none();
+            // this is a perfect shredding, where the value is entirely 
shredded out, so we can just return the typed value
+            // note that we MUST check value_not_present, because some of the 
`typed_value` might be null but data is present in the `value` column.
+            // an alternative is to count whether typed_value has any 
non-nulls, or check every row in `value` is null,
+            // but this is too complicated and might be slow.

Review Comment:
   The expensive check to avoid is counting up `Variant::Null` values.
   
   Unfortunately, NULL in a `typed_value` column signals a non-NULL `value` 
(even if that value is `Variant::Null`), so a nullable column will usually not 
shred perfectly even if all non-NULL values shred perfectly. We must fall back 
to a variant builder that manually converts a `value` column full of 
`Variant::Null` back to NULL.



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
         return Ok(ArrayRef::from(target));
     };
 
+    // Try to return the typed value directly when we have a perfect shredding 
match.
+    if !matches!(as_field.data_type(), DataType::Struct(_)) {
+        if let Some(typed_value) = target.typed_value_field() {
+            let types_match = typed_value.data_type() == as_field.data_type();

Review Comment:
   AFAIK, the current code is correct unless we're willing to inject a 
typecast. 
   Otherwise, the caller could end up with a different type than they 
requested. 
   
   That said, casting does seem reasonable -- it would certainly be faster than 
a variant builder, and the caller _did_ request a specific type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to