shaeqahmed commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2148582542

   I read through the proposal and some thoughts:
   
   It would be really useful to add to this PR a list of ways nested structs 
(struct-of-structs) and array-of-structs can be represented because this is not 
clarified. From my understanding, a nested object path value can be represented 
as either a fully "typed_value", or a variant within a variant (containing a 
required value/metadata, and optional paths), or by directly nesting into the 
paths (a.b) without introducing an intermediate definition level.
   
   Looks like the current proposal for columnarization works well if the data 
in a file mostly has one global structure. However, for heterogenous data 
sources or data sources with fields that alternate between more than one type 
of value occasionally, it seems there are limitations such as potentially 
needing to store the value in the top level value field bag if there is a 
single type conflict for some deeply nested path. 
   
   ---
   
   I would like to propose an alternate way to encode the data that allows for 
more flexibility in representing nested structures and also allows for more 
efficient encoding of the data.
    
   - I propose that simplify the design we require every path part to be 
immediately be followed by a definition level 
($typed_value_*/$untyped_value_variant) that indicates the type of the value at 
that path, allowing for a fully recursive definition of the variant type as 
union of the types observed at each path. 
   - I also propose that we allow storing the key paths in an untyped value 
variant separately as a native parquet list to enable field membership checks 
without having to scan the metadata. In my proposal, the metadata fields are 
also made optional, which if not present, means that the metadata is encoded in 
the value.
   
   Simplest variant example representations, according to my proposal:
   ```
   optional group message { // message: variant (untyped)
       optional group $untyped_value_variant {
           optional binary value;
       }
   }
   ```
   ```
   optional group message { // message: string (typed)
       optional binary $typed_value_string;
   }
   ```
   
   Nested struct example (w/ subcolumnarized paths, nested type conflicts) 
representation, according to my proposal:
   ```
   optional group a {
       optional group $typed_value_object {
           optional group b {
               optional group $typed_value_object {
                   optional group c {
                       optional group $typed_value_object {
                           optional group d { // d: string | untyped 
(value+metadata)
                                   optional binary $typed_value_string;
                                   optional group $untyped_value_variant {
                                        optional binary value;
                                        optional binary metadata; // make 
metadata optional, if not present, it is included in the value
                                        optional group metadata_key_paths 
(LIST) { // also allow to optionally store the list of flattened paths in the 
value as parquet array to enable  dictionary encoding / bloom filters for fast 
lookup without having to scan the metadata.
                                           repeated group list {
                                               optional binary element;
                                           }
                                       }
                                   }
                           }
   
                           optional group e { // e: untyped (value) | object 
(subcolumnarized paths e.x, e.y)
                               optional group $untyped_value_variant {
                                   optional binary value;
                               }
                               optional group $typed_value_object {
                                   optional group x {
                                       optional binary $typed_value_string;
                                   }
                                   optional group y {
                                       optional int64 $typed_value_int64;
                                   }
                               }
                           }
   
                           optional group f { // f: int64
                               optional int64 $typed_value_int64;
                           }
                       }
                   }
               }        
           }
       }
   }
   ```
   
   NOTE: To reduce the nesting in cases where a field is only present as a 
single type, a short form could be introduced that allows concatenating the 
definition level into the path name, making the simplest example representation 
even compacter:
   ```
   optional string message.$typed_value_string;
   ````


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to