wjones127 commented on issue #42069:
URL: https://github.com/apache/arrow/issues/42069#issuecomment-2166108766

   I talked to developers at Databricks who worked on adding this feature to 
Spark and Delta Lake. Here a few notes from that.
   
   - This is being added as a data type in Spark and Delta Lake. They intend to 
add this data type to Iceberg as well.
   - They called it the “Open Variant Data Type” with the intention that this 
data type would proliferate to other systems.
       - They have a standalone Java library that implements the data type. 
That’s the Java code at 
https://github.com/apache/spark/tree/master/common/variant
   - The data is stored as two binary fields: one to hold a string dictionary, 
the other to hold the binary representation of the values. It is generally kept 
as binary data in memory, but engines are free to manipulate it as they wish.
   - The eventual plan is to support record shredding, where fields that have 
dense values will be split out into their own columns. This allows row group / 
page pruning to happen with normal Parquet statistics / indices.
       - Record shredding will have to be the same per Parquet file, but could 
be different between files.
       - Once in memory, variants will be either recombined into the two binary 
columns or else have been selected back into their fully shredded forms. This 
is because most engines will require a common schema across files. The good 
news here is that means by the time it might be exported into Arrow data, we 
wouldn’t have to worry about the shredding.
   - Performance justification: JSON and BSON are not designed for OLAP queries.
       - The canonical pathological case is where you are extracting the last 
field in a large object. JSON has to do `O(n)` string comparisons, the variant 
form replaces them with integer comparisons.
       - The main performance optimization is that object keys (and other 
common strings) are pulled out into a common string dictionary. This reduces 
the size, but also replaces all the string comparisons needed in field lookups 
with integer comparisons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to