scovich opened a new issue, #7715:
URL: https://github.com/apache/arrow-rs/issues/7715
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
The variant shredding specification allows for variant values to be
"shredded" where part of the overall variant is strongly-typed and part is
normal binary variant. Working with shredded variant values requires writers to
pull out specific subsets of a variant object that match a target schema
("shredding"). It also requires readers to potentially "unshred" by injecting
strongly typed values back into the binary variant they came from.
Partial shredding of object values increases the complexity significantly --
some fields of an object could be shredded out as a struct, while others are
not. And so on, recursively.
NOTE: The specification mandates that the variant metadata dictionary must
contain all path parts, regardless of whether a given path is shredded or not.
So the unshredding operation does not modify the variant metadata dictionary.
**Describe the solution you'd like**
Ultimately, shredding and unshredding will be a problem for arrow-array
and/or arrow-compute to solve (see below). But those higher level operations
will need low-level support from `Variant` and its decoders/builders in order
to do their work.
We should start figuring out what that low-level support looks like. A
likely starting point would be the ability to insert and remove specific
variant values from an existing variant object. These should be cheap
byte-shuffling operations that don't waste time introspecting unrelated parts
of the variant value buffer. And it needs to be efficient even when doing
recursive inserts and removes as part of a partial (un)shredding operation.
At the higher level:
The parquet reader and writer will just use whatever shredding schema they
receive from the parquet footer or user, respectively. No special low-level
variant support needed there. But a user wishing to write shredded parquet will
need a way to convert an Array of binary variant values into an Array of
shredded variant values, or a strongly typed Array (e.g. StructArray) into an
Array of shredded variant values. And a user wishing to read shredded parquet
will will need a way to convert an Array of shredded variant values (with a
specific shredding schema) to an Array of binary variant values, or an Array of
shredded variant values having a different shredding schema, or a strongly
typed Array (e.g. StructArray).
**Describe alternatives you've considered**
Just starting to think about this, and realizing we should probably start
figuring out the low-level building blocks that arrow-array will eventually
rely on. Now that we actually have variant builders and decoders, we can
probably make progress here.
**Additional context**
https://github.com/apache/parquet-format/blob/master/VariantShredding.md
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]