scovich commented on code in PR #7915: URL: https://github.com/apache/arrow-rs/pull/7915#discussion_r2204765601
########## parquet-variant/src/builder.rs: ########## @@ -350,14 +378,59 @@ impl<S: AsRef<str>> FromIterator<S> for MetadataBuilder { } } -impl<S: AsRef<str>> Extend<S> for MetadataBuilder { +impl<S: AsRef<str>> Extend<S> for DefaultMetadataBuilder { fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) { for field_name in iter { self.upsert_field_name(field_name.as_ref()); } } } +/// Read-only metadata builder that validates field names against an existing metadata dictionary Review Comment: > how do we plan on storing `typed_value`s? Do we plan on encoding it as a `Variant` and storing the binary representation of it? As I understand it, we don't need to store `typed_value` in the first place. This is low-level variant code here, and a given variant builder only knows about the _unshredded_ bits (it's trying to build up a `value` column). Whoever manages the builder will be in charge of the _shredded_ bits, and is also in charge of providing an appropriate metadata "builder" for the `metadata` column. I think @alamb is proposing in this specific case to convert the `typed_value` into a proper variant, but then "merge" it into the existing variant using a view, instead of copying (some of my thoughts on _that_ below). Aside: I think there are actually _three_ kinds of metadata builders * Today's traditional `MetadataBuilder` (renamed here as `DefaultMetadataBuilder`) * Used to build up a top-level variant. Every newly encountered path is an insertion into that metadata dictionary * A metadata builder used only for unshredding a `typed_value` column (`ReadOnlyMetadataBuilder` here) * A "borrowed" parent metadata builder (used during shredding operations, not present in this PR) * See https://github.com/apache/arrow-rs/pull/7921#discussion_r2204684924 * Used for building up variant columns that happen to live inside the `typed_value` column of some shredded parent variant ancestor column, since the spec requires that outer metadata dictionary to contain _all_ paths reachable through that ancestor variant column. > What if we made a _view_ into the variant so we didn't have to copy anything to read the Variant? Interesting. Trying to wrap my head around this idea, so what follows is likely a bit rambly: * What do we gain by using views at this low level? * At the high level, the `VariantArray` just stores an actual (strongly-typed) `typed_value` columns. It doesn't need to unshred at all, unless/until somebody executes a `variant_get` kernel with a path that ends inside the `typed_value` column. * In that case, we have no choice but to unshred that path, which is a plain old ordinary variant building exercise because the spec forbids both `value` and `typed_value` to contain the same path -- except the `metadata` column is already fully built and we can just reuse it as `ReadOnlyMetadata`. * If somebody does a `variant_get` on any path that ends inside the `value` column, we just extract that variant as normal (the shredded bits provably don't matter). * If somebody actually _asks_ to unshred the entire top-level column (e.g. invokes `unshred_variant` in preparation for a write), the result is a simple struct-of-binary columns. No (low level) `Variant` anywhere in sight -- they're an internal detail of that kernel. * If we're anyway forced to write out normal variant bytes, where/how does the view help along the way? * Nesting will be "fun" * In the example from my comment on the other PR (linked above), one could have a (horrible) nested scenario like: `v.typed_value.a.b.c.w.typed_value.x.y.z.u: VARIANT`. If we tried to unshred that using views, I'm pretty sure it would quickly fall apart, because `Variant` is ultimately already a view on top of the (one) underlying `&[u8]` it was built from. So there's no way a field of a `Variant::ShreddedObject` could itself contain other `Variant::ShreddedObject`. * Intuitive claim (unsubstantiated): * If we could get away with "views" here on the unshredding path, why can't the native builder get away with "views" internally to avoid all the repeated copying of bytes as we unwind nested builders? * But every time I've tried to get fancy with copy-free nested builders, it gets ugly fast. * So (again, intuitively), I would conclude that any attempt at views here would also get ugly fast. * Or, if it turns out we can elegantly solve it here, let's be sure to apply the solution there! * Isolated reaction: I would favor `Variant::ShreddedObject` as its own enum variant, rather than making `Variant::Object` more complex. They're pretty fundamentally different cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org