[I] Add low level support for shredding and unshredding [arrow-rs]

via GitHub Thu, 19 Jun 2025 05:41:00 -0700


scovich opened a new issue, #7715:
URL: https://github.com/apache/arrow-rs/issues/7715


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The variant shredding specification allows for variant values to be 
"shredded" where part of the overall variant is strongly-typed and part is 
normal binary variant. Working with shredded variant values requires writers to 
pull out specific subsets of a variant object that match a target schema 
("shredding"). It also requires readers to potentially "unshred" by injecting 
strongly typed values back into the binary variant they came from. 
   
   Partial shredding of object values increases the complexity significantly -- 
some fields of an object could be shredded out as a struct, while others are 
not. And so on, recursively.
   
   NOTE: The specification mandates that the variant metadata dictionary must 
contain all path parts, regardless of whether a given path is shredded or not. 
So the unshredding operation does not modify the variant metadata dictionary.
   
   **Describe the solution you'd like**
   
   
   Ultimately, shredding and unshredding will be a problem for arrow-array 
and/or arrow-compute to solve (see below). But those higher level operations 
will need low-level support from `Variant` and its decoders/builders in order 
to do their work.
   
   We should start figuring out what that low-level support looks like. A 
likely starting point would be the ability to insert and remove specific 
variant values from an existing variant object. These should be cheap 
byte-shuffling operations that don't waste time introspecting unrelated parts 
of the variant value buffer. And it needs to be efficient even when doing 
recursive inserts and removes as part of a partial (un)shredding operation.
   
   At the higher level: 
   
   The parquet reader and writer will just use whatever shredding schema they 
receive from the parquet footer or user, respectively. No special low-level 
variant support needed there. But a user wishing to write shredded parquet will 
need a way to convert an Array of binary variant values into an Array of 
shredded variant values, or a strongly typed Array (e.g. StructArray) into an 
Array of shredded variant values. And a user wishing to read shredded parquet 
will will need a way to convert an Array of shredded variant values (with a 
specific shredding schema) to an Array of binary variant values, or an Array of 
shredded variant values having a different shredding schema, or a strongly 
typed Array (e.g. StructArray). 
   
   **Describe alternatives you've considered**
   
   Just starting to think about this, and realizing we should probably start 
figuring out the low-level building blocks that arrow-array will eventually 
rely on. Now that we actually have variant builders and decoders, we can 
probably make progress here.
   
   **Additional context**
   
   https://github.com/apache/parquet-format/blob/master/VariantShredding.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add low level support for shredding and unshredding [arrow-rs]

Reply via email to