shaeqahmed commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2156146277

   Thanks for the response and feedback @cashmand!
   
   Can you please include an example on the document which clarifies how the 
current proposal deals with nested structs and the general way deeply nested 
data is expected to be converted from JSON -> the shredded variant form? It is 
not clear how a nested key path with an object value should be encoded as there 
are 2-3 implicit ways this can be done IIUC, each with their own tradeoffs: 
either by adding it as a nested variant (assuming this is supported), or adding 
the struct directly as a nested key path within the existing path/paths.*, or 
maybe adding it as a struct typed_value (assuming this is not supported, but it 
is not actually stated whether or not it is allowed to have a typed value as a 
struct, and if typed values can be nested within typed values?). Elaborating on 
this would be useful for readers to understand how this proposal deals with the 
different cases of structs of structs with structured and unstructured parts.
   
   ---
   
   The real world use case I have in mind is semi structured log analytics, 
particularly on data that comes from upstream sources that contain heterogenous 
loosely typed data. A good example of this is AWS Cloudtrail Logs 
(https://www.databricks.com/blog/2022/06/03/building-etl-pipelines-for-the-cybersecurity-lakehouse-with-delta-live-tables.html),
 which has variant fields like requestParameters and responseElements whose 
schema shape is largely relational but directly determined by the AWS service  
(of which there are a few hundred, the cardinality of the eventProvider field) 
that a given log row belongs to. Fields like requestParameters and 
responseElements also contain arbitrary user input that is completely 
unstructured and such key paths' data should ideally end up stored in an 
untyped blob field, while all other key paths should be subcolumnarized for 
performance in analytical queries. The current proposal makes it difficult to 
encode this data in a subcolumnarized way as there
  is no single global schema that can be inferred by reading the first N rows 
from a large batch (e.g a file).
   
   I agree that having more than one typed field for a given key path per 
smaller batch of rows (e.g. ~10,000) is not necessary, 
   but the reason for adding this flexibility to the variant representation is 
that the current proposal does not allow for taking a series of row batches 
representing different locally discovered schemas and unioning them together to 
form a file containing a large batch of rows (256MB-10GB) efficiently and 
without type conflicts. The idea is that the writer should group the rows in 
smaller batches and sort in a way that is designed to place similarly shaped 
data closer together in the file.
   
   My proposal is inspired by some of the state of the art research done for 
the Umbra database in the JSON Tiles paper 
(https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf), which describes 
a columnar format and approach designed to exploit implicit schemas in semi 
structured data and popular real world implementations of Variant such as that 
in Apache Doris Lakehouse (https://github.com/apache/doris/issues/26225). 
   
   In our case for open tables, Parquet v1/v2 has some limitations that must be 
kept in mind like extra overhead associated with wide tables / too many 
definition levels (> tens of thousands of columns) and the inability to have a 
separate subschema per row group which can result in sparse null columns. 
However, it is still possible to take advantage of subcolumnarization on 
heterogenous data if the data is laid out correctly so as to maximize RLE on 
null arrays and using a compact representation that doesn't require an extra 
definition level (e.g. x.typed_value in the current proposal) for value paths 
that have no conflicts in a file.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to