scovich commented on code in PR #9663:
URL: https://github.com/apache/arrow-rs/pull/9663#discussion_r3101216129
##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -321,12 +328,19 @@ impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> {
// If the variant is not an array, typed_value must be null.
// If the variant is an array, value must be null.
match variant {
- Variant::List(list) => {
+ Variant::List(ref list) => {
self.nulls.append_non_null();
- self.value_builder.append_null();
- self.typed_value_builder
- .append_value(&Variant::List(list))?;
- Ok(true)
+
+ // With `safe` cast option set to false, appending list of
wrong size to
+ // `typed_value_builder` of type `FixedSizeList` will result
in an error. In such a
+ // case, the provided list should be appended to the
`value_builder.
Review Comment:
I don't know...
My gut reaction is to forbid it completely because it's preferable to fail
early and consistently rather than blow up unpredictably partway through
shredding. The spec forbids us to fallback to `value` when the length is wrong,
and we can't emit NULL either (regardless of safe casting options) because
shredding is supposed to be lossless (it isn't a converting cast).
But I also don't have any intuition of actual use cases for this feature, to
weigh its benefits against the downsides.
NOTE: The same dilemma applies to fixed length binary and fixed length
string, _except_ the spec technically doesn't forbid us to fallback to the
`value` column. But how would we actually _use_ it?
# Thought experiment
Suppose we _could_ use the `value` column as fallback in all three cases?
Then the writer produces a shredded variant column where all wrong-length
values are in `value` column (along with the wrong-type values) while all
right-type-right-length values are in `typed_value`. But what good does it do?
A reader who comes along later has no way to know about any of this, because
there's nothing in the parquet that reliably indicates this was a fixed length
array/binary/string column. Sure, we might embed arrow metadata, but non-arrow
readers would ignore it and even arrow readers are on shaky ground trusting it
because this is something entirely outside the variant spec.
Overall, enforcing list/binary/string length feels like a logical validation
issue while shredding is a physical operation. And I suspect it's best to leave
client code to enforce logical validity of all forms, not try to encode it in
the type system. Additionally, It feels like fixed-len-xxx types fall in the
same category as unsigned or non-zero types. True, arrow has unsigned int types
but it lacks non-zero types. And parquet has neither.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]