shaeqahmed commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2148582542
I read through the proposal and some thoughts:
It would be really useful to add to this PR a list of ways nested structs
(struct-of-structs) and array-of-structs can be represented because this is not
clarified. From my understanding, a nested object path value can be represented
as either a fully "typed_value", or a variant within a variant (containing a
required value/metadata, and optional paths), or by directly nesting into the
paths (a.b) without introducing an intermediate definition level.
Looks like the current proposal for columnarization works well if the data
in a file mostly has one global structure. However, for heterogenous data
sources or data sources with fields that alternate between more than one type
of value occasionally, it seems there are limitations such as potentially
needing to store the value in the top level value field bag if there is a
single type conflict for some deeply nested path.
---
I would like to propose an alternate way to encode the data that allows for
more flexibility in representing nested structures and also allows for more
efficient encoding of the data.
- I propose that simplify the design we require every path part to be
immediately be followed by a definition level
($typed_value_*/$untyped_value_variant) that indicates the type of the value at
that path, allowing for a fully recursive definition of the variant type as
union of the types observed at each path.
- I also propose that we allow storing the key paths in an untyped value
variant separately as a native parquet list to enable field membership checks
without having to scan the metadata. In my proposal, the metadata fields are
also made optional, which if not present, means that the metadata is encoded in
the value.
Simplest variant example representations, according to my proposal:
```
optional group message { // message: variant (untyped)
optional group $untyped_value_variant {
optional binary value;
}
}
```
```
optional group message { // message: string (typed)
optional binary $typed_value_string;
}
```
Nested struct example (w/ subcolumnarized paths, nested type conflicts)
representation, according to my proposal:
```
optional group a {
optional group $typed_value_object {
optional group b {
optional group $typed_value_object {
optional group c {
optional group $typed_value_object {
optional group d { // d: string | untyped
(value+metadata)
optional binary $typed_value_string;
optional group $untyped_value_variant {
optional binary value;
optional binary metadata; // make
metadata optional, if not present, it is included in the value
optional group metadata_key_paths
(LIST) { // also allow to optionally store the list of flattened paths in the
value as parquet array to enable dictionary encoding / bloom filters for fast
lookup without having to scan the metadata.
repeated group list {
optional binary element;
}
}
}
}
optional group e { // e: untyped (value) | object
(subcolumnarized paths e.x, e.y)
optional group $untyped_value_variant {
optional binary value;
}
optional group $typed_value_object {
optional group x {
optional binary $typed_value_string;
}
optional group y {
optional int64 $typed_value_int64;
}
}
}
optional group f { // f: int64
optional int64 $typed_value_int64;
}
}
}
}
}
}
}
```
NOTE: To reduce the nesting in cases where a field is only present as a
single type, a short form could be introduced that allows concatenating the
definition level into the path name, making the simplest example representation
even compacter:
```
optional string message.$typed_value_string;
````
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]