shaeqahmed commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2156146277
Thanks for the response and feedback @cashmand! Can you please include an example on the document which clarifies how the current proposal deals with nested structs and the general way deeply nested data is expected to be converted from JSON -> the shredded variant form? It is not clear how a nested key path with an object value should be encoded as there are 2-3 implicit ways this can be done IIUC, each with their own tradeoffs: either by adding it as a nested variant (assuming this is supported), or adding the struct directly as a nested key path within the existing path/paths.*, or maybe adding it as a struct typed_value (assuming this is not supported, but it is not actually stated whether or not it is allowed to have a typed value as a struct, and if typed values can be nested within typed values?). Elaborating on this would be useful for readers to understand how this proposal deals with the different cases of structs of structs with structured and unstructured parts. --- The real world use case I have in mind is semi structured log analytics, particularly on data that comes from upstream sources that contain heterogenous loosely typed data. A good example of this is AWS Cloudtrail Logs (https://www.databricks.com/blog/2022/06/03/building-etl-pipelines-for-the-cybersecurity-lakehouse-with-delta-live-tables.html), which has variant fields like requestParameters and responseElements whose schema shape is largely relational but directly determined by the AWS service (of which there are a few hundred, the cardinality of the eventProvider field) that a given log row belongs to. Fields like requestParameters and responseElements also contain arbitrary user input that is completely unstructured and such key paths' data should ideally end up stored in an untyped blob field, while all other key paths should be subcolumnarized for performance in analytical queries. The current proposal makes it difficult to encode this data in a subcolumnarized way as there is no single global schema that can be inferred by reading the first N rows from a large batch (e.g a file). I agree that having more than one typed field for a given key path per smaller batch of rows (e.g. ~10,000) is not necessary, but the reason for adding this flexibility to the variant representation is that the current proposal does not allow for taking a series of row batches representing different locally discovered schemas and unioning them together to form a file containing a large batch of rows (256MB-10GB) efficiently and without type conflicts. The idea is that the writer should group the rows in smaller batches and sort in a way that is designed to place similarly shaped data closer together in the file. My proposal is inspired by some of the state of the art research done for the Umbra database in the JSON Tiles paper (https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf), which describes a columnar format and approach designed to exploit implicit schemas in semi structured data and popular real world implementations of Variant such as that in Apache Doris Lakehouse (https://github.com/apache/doris/issues/26225). In our case for open tables, Parquet v1/v2 has some limitations that must be kept in mind like extra overhead associated with wide tables / too many definition levels (> tens of thousands of columns) and the inability to have a separate subschema per row group which can result in sparse null columns. However, it is still possible to take advantage of subcolumnarization on heterogenous data if the data is laid out correctly so as to maximize RLE on null arrays and using a compact representation that doesn't require an extra definition level (e.g. x.typed_value in the current proposal) for value paths that have no conflicts in a file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
