scovich commented on issue #8319: URL: https://github.com/apache/arrow-rs/issues/8319#issuecomment-3285956937
It seems like no matter what, a variant column is is just a plain old ordinary `StructType` and `StructArray`. We could take several approaches to the variant extension type (not necessarily mutually exclusive): 1. Whatever schema/struct/map/list contains the variant column annotates the column's field as `parquet.variant`. 2. Inside a shredded variant, the `typed_value` field of a (partially) shredded variant object field is annotated as `parquet.variant.shredded-object-field` and the `typed_value` field of a shredded variant array element is annotated as `parquet.variant.shredded-array-element`. 3. The `metadata` and `value` fields of that struct are themselves extension types (e.g. `parquet.variant.metadata` and `parquet.variant.binary`, respectively) 1/ is probably the most intuitive way for users to track variant columns in their schemas and data. * So we probably want to do that (following the example of other canonical extension types). * But if we _only_ implement /1, I'm not sure it gives us as much "protection" as we'd like? * For example, `cast_to_variant` takes `&dyn Array` as input, and would be incurably vulnerable to a user who passes a variant column (either accidentally, or intentionally hoping to unshred it). * Fortunately, `variant_get` takes a `FieldRef` and so can conceivably be taught to handle a request to extract variant data (binary or shredded) from the input (including a request to extract a struct with variant fields). If we do 1/, we probably want to do 2/ as well. * Shredded object fields and array elements have different structure and null-handling semantics than top-level variant columns do, even tho they are otherwise quite similar. * On the other hand, we can at least know from context when to expect (or not expect) a shredded object field or array element, as long as we reliably know when we start traversing a (possibly shredded) variant column. If we do 3/, then we can identify variant data even if the caller forgot to check the field that contained the struct array or struct type they passed us. * One could imagine that it's enough to _only_ do 3/, tho we might have to do more reverse engineering work when working with variant data. * However, there's additional complexity because now extension types are buried deep within a nested top-level extension type. I could imagine that causing some issues if we're not careful? A similar risk applies to 2/ as well, now that I think about it. This probably needs some pathfinding... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org