scovich commented on issue #8319:
URL: https://github.com/apache/arrow-rs/issues/8319#issuecomment-3285956937

   It seems like no matter what, a variant column is is just a plain old 
ordinary `StructType` and `StructArray`. 
   
   We could take several approaches to the variant extension type (not 
necessarily mutually exclusive):
   1. Whatever schema/struct/map/list contains the variant column annotates the 
column's field as `parquet.variant`.
   2. Inside a shredded variant, the `typed_value` field of a (partially) 
shredded variant object field is annotated as 
`parquet.variant.shredded-object-field` and the `typed_value` field of a 
shredded variant array element is annotated as 
`parquet.variant.shredded-array-element`. 
   3. The `metadata` and `value` fields of that struct are themselves extension 
types (e.g. `parquet.variant.metadata` and `parquet.variant.binary`, 
respectively)
   
   1/ is probably the most intuitive way for users to track variant columns in 
their schemas and data. 
   * So we probably want to do that (following the example of other canonical 
extension types). 
   * But if we _only_ implement /1, I'm not sure it gives us as much 
"protection" as we'd like? 
   * For example, `cast_to_variant` takes `&dyn Array` as input, and would be 
incurably vulnerable to a user who passes a variant column (either 
accidentally, or intentionally hoping to unshred it). 
   * Fortunately, `variant_get` takes a `FieldRef` and so can conceivably be 
taught to handle a request to extract variant data (binary or shredded) from 
the input (including a request to extract a struct with variant fields).
   
   If we do 1/, we probably want to do 2/ as well. 
   * Shredded object fields and array elements have different structure and 
null-handling semantics than top-level variant columns do, even tho they are 
otherwise quite similar.
   * On the other hand, we can at least know from context when to expect (or 
not expect) a shredded object field or array element, as long as we reliably 
know when we start traversing a (possibly shredded) variant column.
   
   If we do 3/, then we can identify variant data even if the caller forgot to 
check the field that contained the struct array or struct type they passed us. 
   * One could imagine that it's enough to _only_ do 3/, tho we might have to 
do more reverse engineering work when working with variant data.
   * However, there's additional complexity because now extension types are 
buried deep within a nested top-level extension type. I could imagine that 
causing some issues if we're not careful? A similar risk applies to 2/ as well, 
now that I think about it.
   
   
   This probably needs some pathfinding...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to