scovich commented on code in PR #9598:
URL: https://github.com/apache/arrow-rs/pull/9598#discussion_r3058667538
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
Review Comment:
This races "silently" with https://github.com/apache/arrow-rs/pull/9677 --
no obvious (physical) merge conflict.
Should we wait for the other to merge and take advantage of the new API?
Or coordinate with @sdf-jkl to merge this PR first and update the other PR
to optimize this new call site?
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -86,12 +88,13 @@ pub(crate) fn follow_shredded_path_element<'a>(
return Ok(missing_path_step());
};
+ // In a shredded variant object, every field must be a struct
containing
+ // `value` and/or `typed_value` sub-fields. A non-struct field
here indicates
+ // malformed shredded data, so we error out regardless of cast
safety mode.
let struct_array = field.as_struct_opt().ok_or_else(|| {
- // TODO: Should we blow up? Or just end the traversal and let
the normal
- // variant pathing code sort out the mess that it must anyway
be
- // prepared to handle?
ArrowError::InvalidArgumentError(format!(
- "Expected Struct array while following path, got {}",
+ "Expected shredded variant struct for field '{}', got {}",
Review Comment:
Is this perhaps clearer?
```suggestion
"Physical layout of shredded variant field '{}' should
be a struct, got {}",
```
I'm struggling with the right wording to convey that this particular struct
is part of shredding's internal layout, unrelated to the logical data type the
user supplied. Similar to how maps are physically represented as structs.
(the code comment is very nice and clear btw -- I'm only worried about this
error message)
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ if is_variant_field {
+ // Variant field: the child's actual data type (VariantArray's
internal
+ // StructArray) differs from the user's declared type (empty
Struct),
+ // so we must update the field's data type and re-attach
VariantType.
+ //
+ // When the field is entirely absent, shredded_get_path
returns a
+ // NullArray. Construct an all-null VariantArray so the
extension
+ // metadata is preserved.
+ let child = if child.data_type() == &DataType::Null {
+ let mut builder = VariantArrayBuilder::new(child.len());
+ for _ in 0..child.len() {
+ builder.append_null();
+ }
Review Comment:
There's been a recent bunch of PR adding (or expanding) support for
bulk-appending NULL values to array builders, e.g.
[PrimitiveArrayBuilder::append_nulls](https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveBuilder.html#method.append_nulls),
[StructBuilder::append_nulls](https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html#method.append_nulls),
etc. Should we take a follow-on item to add that support to variant array
builder as well? (Some builders have `append_values` methods as well).
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ if is_variant_field {
+ // Variant field: the child's actual data type (VariantArray's
internal
+ // StructArray) differs from the user's declared type (empty
Struct),
+ // so we must update the field's data type and re-attach
VariantType.
+ //
+ // When the field is entirely absent, shredded_get_path
returns a
Review Comment:
```suggestion
// Additionally, when the field is entirely absent,
shredded_get_path returns a
```
(it took me a minute to grok that these are two mostly-separate issues)
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ if is_variant_field {
+ // Variant field: the child's actual data type (VariantArray's
internal
+ // StructArray) differs from the user's declared type (empty
Struct),
Review Comment:
Wait... is it ever (supposed to be) an empty struct?
The extension type blindly accepts any struct, so I guess that does work.
And I don't see any code that would help a user create a correct physical
variant struct (binary variant, shredded, etc).
Which opens up an interesting design question, because a request for
variant_get to return variant could be any of three "flavors":
1. Generic
* "give me back variant, in whatever form it takes"
* If the underlying path is already variant, return it unchanged
* If strongly typed, there's a very fast path to shred it -- especially
if the column contains no nulls
* Probably the most common, and also what this PR does (I think)
2. Unshredded
* "give me back binary variant no matter what form the data takes"
* If the underlying path is variant, unshred it as needed
* If strongly typed, convert to binary variant
* Useful for a caller who doesn't want to deal with (or cannot handle)
the complexity and physical schema heterogeneity of shredding
3. Shredded
* "give me back shredded variant following the specific shredding schema
I provided"
* If the underlying path is variant, shred or reshred as needed
* If the underlying path is strongly typed, shred it using the provided
shredding schema
* Useful for a caller who expects a specific format to the data but needs
to handle outliers
I guess the best way to handle this is for users to send an empty struct for
1/, with a non-empty struct triggering 2/ and 3/ as appropriate? With the usual
recursive pathing to handle each leaf field as efficiently as possible? But we
don't currently have a way to pass a shredding schema, so everyone is forced to
1/.
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ if is_variant_field {
+ // Variant field: the child's actual data type (VariantArray's
internal
+ // StructArray) differs from the user's declared type (empty
Struct),
+ // so we must update the field's data type and re-attach
VariantType.
+ //
+ // When the field is entirely absent, shredded_get_path
returns a
+ // NullArray. Construct an all-null VariantArray so the
extension
+ // metadata is preserved.
+ let child = if child.data_type() == &DataType::Null {
+ let mut builder = VariantArrayBuilder::new(child.len());
+ for _ in 0..child.len() {
+ builder.append_null();
+ }
+ Arc::new(builder.build().into_inner()) as ArrayRef
+ } else {
+ child
+ };
+ let new_field = field
+ .as_ref()
+ .clone()
+ .with_data_type(child.data_type().clone())
+ .with_extension_type(VariantType);
+ updated_fields.push(new_field);
Review Comment:
nit: We're cloning and then discarding the field's existing data type.
```suggestion
let new_field = Field::new(
field.name().clone(),
child.data_type().clone(),
field.is_nullable(),
);
updated_fields.push(Arc::new(new_field.with_extension_type(VariantType)));
```
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -223,26 +226,65 @@ fn shredded_get_path(
// For shredded/partially-shredded targets (`typed_value` present),
recurse into each field
// separately to take advantage of deeper shredding in child fields.
if let DataType::Struct(fields) = as_field.data_type() {
- if target.typed_value_field().is_none() {
+ let has_variant_fields = fields
+ .iter()
+ .any(|f| f.try_extension_type::<VariantType>().is_ok());
+ if target.typed_value_field().is_none() && !has_variant_fields {
return shred_basic_variant(target, VariantPath::default(),
Some(as_field));
}
- let children = fields
- .iter()
- .map(|field| {
- shredded_get_path(
- &target,
- &[VariantPathElement::from(field.name().as_str())],
- Some(field),
- cast_options,
- )
- })
- .collect::<Result<Vec<_>>>()?;
+ let mut updated_fields = Vec::with_capacity(fields.len());
+ let mut children = Vec::with_capacity(fields.len());
+ for field in fields.iter() {
+ // If the field has VariantType extension metadata, extract it as a
+ // VariantArray instead of casting to the declared data type. This
allows
+ // callers to request structs where some fields remain as variants.
+ // See test_struct_extraction_with_variant_fields for usage
example.
+ let is_variant_field =
field.try_extension_type::<VariantType>().is_ok();
+ let field_as_type = (!is_variant_field).then(|| field.as_ref());
+ let child = shredded_get_path(
+ &target,
+ &[VariantPathElement::from(field.name().as_str())],
+ field_as_type,
+ cast_options,
+ )?;
+
+ if is_variant_field {
+ // Variant field: the child's actual data type (VariantArray's
internal
+ // StructArray) differs from the user's declared type (empty
Struct),
Review Comment:
This relates to
https://github.com/apache/arrow-rs/pull/9598#discussion_r3022228974, where I
was worried that variant extension type would "leak" -- because I had been
assuming we covered all three cases above, and that variant extension type was
used recursively as the marker to distinguish between strongly typed vs.
shredding schemas (which becomes a problem once we recurse into a shredded
schema).
In effect, we have several different "flavors" of calls to
`shredded_path_get`:
1. No specific type requested (return whatever we find)
2. Strong type requested (parse out the variant values as needed)
3. Specific variant type requested (provided field is a struct with variant
extension type metadata)
4. Recursing into the shredded field of a specific requested variant type
The problem is, a strongly typed struct (2/) is indistinguishable from a
shredded variant field (4/) -- both fields are structs lacking variant
extension type metadata.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]