scovich commented on code in PR #9598:
URL: https://github.com/apache/arrow-rs/pull/9598#discussion_r3058667538


##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());

Review Comment:
   This races "silently" with https://github.com/apache/arrow-rs/pull/9677 -- 
no obvious (physical) merge conflict. 
   Should we wait for the other to merge and take advantage of the new API? 
   Or coordinate with @sdf-jkl to merge this PR first and update the other PR 
to optimize this new call site?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -86,12 +88,13 @@ pub(crate) fn follow_shredded_path_element<'a>(
                 return Ok(missing_path_step());
             };
 
+            // In a shredded variant object, every field must be a struct 
containing
+            // `value` and/or `typed_value` sub-fields. A non-struct field 
here indicates
+            // malformed shredded data, so we error out regardless of cast 
safety mode.
             let struct_array = field.as_struct_opt().ok_or_else(|| {
-                // TODO: Should we blow up? Or just end the traversal and let 
the normal
-                // variant pathing code sort out the mess that it must anyway 
be
-                // prepared to handle?
                 ArrowError::InvalidArgumentError(format!(
-                    "Expected Struct array while following path, got {}",
+                    "Expected shredded variant struct for field '{}', got {}",

Review Comment:
   Is this perhaps clearer?
   ```suggestion
                       "Physical layout of shredded variant field '{}' should 
be a struct, got {}",
   ```
   I'm struggling with the right wording to convey that this particular struct 
is part of shredding's internal layout, unrelated to the logical data type the 
user supplied. Similar to how maps are physically represented as structs.
   
   (the code comment is very nice and clear btw -- I'm only worried about this 
error message)



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());
+        if target.typed_value_field().is_none() && !has_variant_fields {
             return shred_basic_variant(target, VariantPath::default(), 
Some(as_field));
         }
 
-        let children = fields
-            .iter()
-            .map(|field| {
-                shredded_get_path(
-                    &target,
-                    &[VariantPathElement::from(field.name().as_str())],
-                    Some(field),
-                    cast_options,
-                )
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let mut updated_fields = Vec::with_capacity(fields.len());
+        let mut children = Vec::with_capacity(fields.len());
+        for field in fields.iter() {
+            // If the field has VariantType extension metadata, extract it as a
+            // VariantArray instead of casting to the declared data type. This 
allows
+            // callers to request structs where some fields remain as variants.
+            // See test_struct_extraction_with_variant_fields for usage 
example.
+            let is_variant_field = 
field.try_extension_type::<VariantType>().is_ok();
+            let field_as_type = (!is_variant_field).then(|| field.as_ref());
+            let child = shredded_get_path(
+                &target,
+                &[VariantPathElement::from(field.name().as_str())],
+                field_as_type,
+                cast_options,
+            )?;
+
+            if is_variant_field {
+                // Variant field: the child's actual data type (VariantArray's 
internal
+                // StructArray) differs from the user's declared type (empty 
Struct),
+                // so we must update the field's data type and re-attach 
VariantType.
+                //
+                // When the field is entirely absent, shredded_get_path 
returns a
+                // NullArray. Construct an all-null VariantArray so the 
extension
+                // metadata is preserved.
+                let child = if child.data_type() == &DataType::Null {
+                    let mut builder = VariantArrayBuilder::new(child.len());
+                    for _ in 0..child.len() {
+                        builder.append_null();
+                    }

Review Comment:
   There's been a recent bunch of PR adding (or expanding) support for 
bulk-appending NULL values to array builders, e.g. 
[PrimitiveArrayBuilder::append_nulls](https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveBuilder.html#method.append_nulls),
 
[StructBuilder::append_nulls](https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html#method.append_nulls),
 etc. Should we take a follow-on item to add that support to variant array 
builder as well? (Some builders have `append_values` methods as well).



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());
+        if target.typed_value_field().is_none() && !has_variant_fields {
             return shred_basic_variant(target, VariantPath::default(), 
Some(as_field));
         }
 
-        let children = fields
-            .iter()
-            .map(|field| {
-                shredded_get_path(
-                    &target,
-                    &[VariantPathElement::from(field.name().as_str())],
-                    Some(field),
-                    cast_options,
-                )
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let mut updated_fields = Vec::with_capacity(fields.len());
+        let mut children = Vec::with_capacity(fields.len());
+        for field in fields.iter() {
+            // If the field has VariantType extension metadata, extract it as a
+            // VariantArray instead of casting to the declared data type. This 
allows
+            // callers to request structs where some fields remain as variants.
+            // See test_struct_extraction_with_variant_fields for usage 
example.
+            let is_variant_field = 
field.try_extension_type::<VariantType>().is_ok();
+            let field_as_type = (!is_variant_field).then(|| field.as_ref());
+            let child = shredded_get_path(
+                &target,
+                &[VariantPathElement::from(field.name().as_str())],
+                field_as_type,
+                cast_options,
+            )?;
+
+            if is_variant_field {
+                // Variant field: the child's actual data type (VariantArray's 
internal
+                // StructArray) differs from the user's declared type (empty 
Struct),
+                // so we must update the field's data type and re-attach 
VariantType.
+                //
+                // When the field is entirely absent, shredded_get_path 
returns a

Review Comment:
   ```suggestion
                   // Additionally, when the field is entirely absent, 
shredded_get_path returns a
   ```
   (it took me a minute to grok that these are two mostly-separate issues)



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());
+        if target.typed_value_field().is_none() && !has_variant_fields {
             return shred_basic_variant(target, VariantPath::default(), 
Some(as_field));
         }
 
-        let children = fields
-            .iter()
-            .map(|field| {
-                shredded_get_path(
-                    &target,
-                    &[VariantPathElement::from(field.name().as_str())],
-                    Some(field),
-                    cast_options,
-                )
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let mut updated_fields = Vec::with_capacity(fields.len());
+        let mut children = Vec::with_capacity(fields.len());
+        for field in fields.iter() {
+            // If the field has VariantType extension metadata, extract it as a
+            // VariantArray instead of casting to the declared data type. This 
allows
+            // callers to request structs where some fields remain as variants.
+            // See test_struct_extraction_with_variant_fields for usage 
example.
+            let is_variant_field = 
field.try_extension_type::<VariantType>().is_ok();
+            let field_as_type = (!is_variant_field).then(|| field.as_ref());
+            let child = shredded_get_path(
+                &target,
+                &[VariantPathElement::from(field.name().as_str())],
+                field_as_type,
+                cast_options,
+            )?;
+
+            if is_variant_field {
+                // Variant field: the child's actual data type (VariantArray's 
internal
+                // StructArray) differs from the user's declared type (empty 
Struct),

Review Comment:
   Wait... is it ever (supposed to be) an empty struct? 
   
   The extension type blindly accepts any struct, so I guess that does work. 
And I don't see any code that would help a user create a correct physical 
variant struct (binary variant, shredded, etc).
   
   Which opens up an interesting design question, because a request for 
variant_get to return variant could be any of three "flavors":
   1. Generic
      * "give me back variant, in whatever form it takes"
      * If the underlying path is already variant, return it unchanged
      * If strongly typed, there's a very fast path to shred it -- especially 
if the column contains no nulls
      * Probably the most common, and also what this PR does (I think)
   2. Unshredded
      * "give me back binary variant no matter what form the data takes"
      * If the underlying path is variant, unshred it as needed
      * If strongly typed, convert to binary variant
      * Useful for a caller who doesn't want to deal with (or cannot handle) 
the complexity and physical schema heterogeneity of shredding
   3. Shredded
      * "give me back shredded variant following the specific shredding schema 
I provided"
      * If the underlying path is variant, shred or reshred as needed
      * If the underlying path is strongly typed, shred it using the provided 
shredding schema
      * Useful for a caller who expects a specific format to the data but needs 
to handle outliers
   
   I guess the best way to handle this is for users to send an empty struct for 
1/, with a non-empty struct triggering 2/ and 3/ as appropriate? With the usual 
recursive pathing to handle each leaf field as efficiently as possible? But we 
don't currently have a way to pass a shredding schema, so everyone is forced to 
1/.



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());
+        if target.typed_value_field().is_none() && !has_variant_fields {
             return shred_basic_variant(target, VariantPath::default(), 
Some(as_field));
         }
 
-        let children = fields
-            .iter()
-            .map(|field| {
-                shredded_get_path(
-                    &target,
-                    &[VariantPathElement::from(field.name().as_str())],
-                    Some(field),
-                    cast_options,
-                )
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let mut updated_fields = Vec::with_capacity(fields.len());
+        let mut children = Vec::with_capacity(fields.len());
+        for field in fields.iter() {
+            // If the field has VariantType extension metadata, extract it as a
+            // VariantArray instead of casting to the declared data type. This 
allows
+            // callers to request structs where some fields remain as variants.
+            // See test_struct_extraction_with_variant_fields for usage 
example.
+            let is_variant_field = 
field.try_extension_type::<VariantType>().is_ok();
+            let field_as_type = (!is_variant_field).then(|| field.as_ref());
+            let child = shredded_get_path(
+                &target,
+                &[VariantPathElement::from(field.name().as_str())],
+                field_as_type,
+                cast_options,
+            )?;
+
+            if is_variant_field {
+                // Variant field: the child's actual data type (VariantArray's 
internal
+                // StructArray) differs from the user's declared type (empty 
Struct),
+                // so we must update the field's data type and re-attach 
VariantType.
+                //
+                // When the field is entirely absent, shredded_get_path 
returns a
+                // NullArray. Construct an all-null VariantArray so the 
extension
+                // metadata is preserved.
+                let child = if child.data_type() == &DataType::Null {
+                    let mut builder = VariantArrayBuilder::new(child.len());
+                    for _ in 0..child.len() {
+                        builder.append_null();
+                    }
+                    Arc::new(builder.build().into_inner()) as ArrayRef
+                } else {
+                    child
+                };
+                let new_field = field
+                    .as_ref()
+                    .clone()
+                    .with_data_type(child.data_type().clone())
+                    .with_extension_type(VariantType);
+                updated_fields.push(new_field);

Review Comment:
   nit: We're cloning and then discarding the field's existing data type.
   ```suggestion
                   let new_field = Field::new(
                       field.name().clone(), 
                       child.data_type().clone(), 
                       field.is_nullable(),
                   );
                   
updated_fields.push(Arc::new(new_field.with_extension_type(VariantType)));
   ```



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
     // For shredded/partially-shredded targets (`typed_value` present), 
recurse into each field
     // separately to take advantage of deeper shredding in child fields.
     if let DataType::Struct(fields) = as_field.data_type() {
-        if target.typed_value_field().is_none() {
+        let has_variant_fields = fields
+            .iter()
+            .any(|f| f.try_extension_type::<VariantType>().is_ok());
+        if target.typed_value_field().is_none() && !has_variant_fields {
             return shred_basic_variant(target, VariantPath::default(), 
Some(as_field));
         }
 
-        let children = fields
-            .iter()
-            .map(|field| {
-                shredded_get_path(
-                    &target,
-                    &[VariantPathElement::from(field.name().as_str())],
-                    Some(field),
-                    cast_options,
-                )
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let mut updated_fields = Vec::with_capacity(fields.len());
+        let mut children = Vec::with_capacity(fields.len());
+        for field in fields.iter() {
+            // If the field has VariantType extension metadata, extract it as a
+            // VariantArray instead of casting to the declared data type. This 
allows
+            // callers to request structs where some fields remain as variants.
+            // See test_struct_extraction_with_variant_fields for usage 
example.
+            let is_variant_field = 
field.try_extension_type::<VariantType>().is_ok();
+            let field_as_type = (!is_variant_field).then(|| field.as_ref());
+            let child = shredded_get_path(
+                &target,
+                &[VariantPathElement::from(field.name().as_str())],
+                field_as_type,
+                cast_options,
+            )?;
+
+            if is_variant_field {
+                // Variant field: the child's actual data type (VariantArray's 
internal
+                // StructArray) differs from the user's declared type (empty 
Struct),

Review Comment:
   This relates to 
https://github.com/apache/arrow-rs/pull/9598#discussion_r3022228974, where I 
was worried that variant extension type would "leak" -- because I had been 
assuming we covered all three cases above, and that variant extension type was 
used recursively as the marker to distinguish between strongly typed vs. 
shredding schemas (which becomes a problem once we recurse into a shredded 
schema).
   
   In effect, we have several different "flavors" of calls to 
`shredded_path_get`:
   1. No specific type requested (return whatever we find)
   2. Strong type requested (parse out the variant values as needed)
   3. Specific variant type requested (provided field is a struct with variant 
extension type metadata)
   4. Recursing into the shredded field of a specific requested variant type 
   
   The problem is, a strongly typed struct (2/) is indistinguishable from a 
shredded variant field (4/) -- both fields are structs lacking variant 
extension type metadata.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to