alamb commented on code in PR #8021:
URL: https://github.com/apache/arrow-rs/pull/8021#discussion_r2256887779
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -135,36 +140,189 @@ impl VariantArray {
self.inner
}
+ /// Return the shredding state of this `VariantArray`
+ pub fn shredding_state(&self) -> &ShreddingState {
+ &self.shredding_state
+ }
+
/// Return the [`Variant`] instance stored at the given row
///
- /// Panics if the index is out of bounds.
+ /// Consistently with other Arrow arrays types, this API requires you to
+ /// check for nulls first using [`Self::is_valid`].
+ ///
+ /// # Panics
+ /// * if the index is out of bounds
+ /// * if the array value is null
+ ///
+ /// If this is a shredded variant but has no value at the shredded
location, it
+ /// will return [`Variant::Null`].
+ ///
+ ///
+ /// # Performance Note
+ ///
+ /// This is certainly not the most efficient way to access values in a
+ /// `VariantArray`, but it is useful for testing and debugging.
///
/// Note: Does not do deep validation of the [`Variant`], so it is up to
the
/// caller to ensure that the metadata and value were constructed
correctly.
pub fn value(&self, index: usize) -> Variant {
- let metadata = self.metadata_field().as_binary_view().value(index);
- let value = self.value_field().as_binary_view().value(index);
- Variant::new(metadata, value)
+ match &self.shredding_state {
+ ShreddingState::Unshredded { metadata, value } => {
+ Variant::new(metadata.value(index), value.value(index))
+ }
+ ShreddingState::FullyShredded {
+ metadata: _,
+ typed_value,
+ } => {
+ if typed_value.is_null(index) {
+ Variant::Null
+ } else {
+ typed_value_to_variant(typed_value, index)
+ }
+ }
+ ShreddingState::PartiallyShredded {
+ metadata,
+ value,
+ typed_value,
+ } => {
+ if typed_value.is_null(index) {
+ Variant::new(metadata.value(index), value.value(index))
+ } else {
+ typed_value_to_variant(typed_value, index)
+ }
+ }
+ }
Review Comment:
> Part of me wonders whether this "shredding state" enum is actually
helping, vs. just storing `value` and `typed_value` as options?
The reason I first introduced the enum was that in `variant_get` I needed to
dispatch to different methods based on the type of shredding, and I found it
myself really wanting to have names for them rather than just the combinations
of value and typed value.
Another reason reason was that I ended up with redundant error checking that
was in the `VariantArray::try_new` (though if perfectly null arrays are
allowed, maybe this becomes irrelevant)
> Especially since the [shredding
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding)
seems to suggest that none-none case is a valid combination?
I personally think having names for the different shredding types makes
working with the code easier, even if all 4 combinations are valid.
I think you are right, however, that the shredding spec allows both to be
None. I double checked the [Arrow
Proposal](https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4)
as it also implies both being null is permitted:
> An optional field named value that is binary, large_binary, or binary_view
> An optional field named typed_value which can be any primitive type or be
a
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]