wjones127 commented on issue #42069:
URL: https://github.com/apache/arrow/issues/42069#issuecomment-2166108766
I talked to developers at Databricks who worked on adding this feature to
Spark and Delta Lake. Here a few notes from that.
- This is being added as a data type in Spark and Delta Lake. They intend to
add this data type to Iceberg as well.
- They called it the “Open Variant Data Type” with the intention that this
data type would proliferate to other systems.
- They have a standalone Java library that implements the data type.
That’s the Java code at
https://github.com/apache/spark/tree/master/common/variant
- The data is stored as two binary fields: one to hold a string dictionary,
the other to hold the binary representation of the values. It is generally kept
as binary data in memory, but engines are free to manipulate it as they wish.
- The eventual plan is to support record shredding, where fields that have
dense values will be split out into their own columns. This allows row group /
page pruning to happen with normal Parquet statistics / indices.
- Record shredding will have to be the same per Parquet file, but could
be different between files.
- Once in memory, variants will be either recombined into the two binary
columns or else have been selected back into their fully shredded forms. This
is because most engines will require a common schema across files. The good
news here is that means by the time it might be exported into Arrow data, we
wouldn’t have to worry about the shredding.
- Performance justification: JSON and BSON are not designed for OLAP queries.
- The canonical pathological case is where you are extracting the last
field in a large object. JSON has to do `O(n)` string comparisons, the variant
form replaces them with integer comparisons.
- The main performance optimization is that object keys (and other
common strings) are pulled out into a common string dictionary. This reduces
the size, but also replaces all the string comparisons needed in field lookups
with integer comparisons.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]