[ 
https://issues.apache.org/jira/browse/HIVE-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-29183:
----------------------------------
    Labels: pull-request-available  (was: )

> Integrating Variant Type into Hive
> ----------------------------------
>
>                 Key: HIVE-29183
>                 URL: https://issues.apache.org/jira/browse/HIVE-29183
>             Project: Hive
>          Issue Type: New Feature
>          Components: Hive, Iceberg integration, SQL
>            Reporter: Denys Kuzmenko
>            Priority: Major
>              Labels: pull-request-available
>
> A variant is a value that stores semi-structured data. The structure and data 
> types in a variant are not necessarily consistent across rows in a table or 
> data file. The variant type and binary encoding are defined in the Parquet 
> project, with support currently available for V1. Support for Variant is 
> added in Iceberg v3.
> Variants are similar to JSON with a wider set of primitive values including 
> date, timestamp, timestamptz, binary, and decimals.
> Variant values may contain nested types:
> * An array is an ordered collection of variant values.
> * An object is a collection of fields that are a string key and a variant 
> value.
> As a semi-structured type, there are important differences between variant 
> and Iceberg's other types:
> * Variant arrays are similar to lists, but may contain any variant value 
> rather than a fixed element type.
> * Variant objects are similar to structs, but may contain variable fields 
> identified by name and field values may be any variant value rather than a 
> fixed field type.
> Variant data types allow for the efficient binary encoding of dynamic 
> semi-structured data such as JSON, Avro, Parquet, etc. By encoding 
> semi-structured data as a variant column, we retain the flexibility of the 
> source data, while allowing query engines to more efficiently operate on the 
> data.
> With the support of Variant type, such data can be encoded in an efficient 
> binary representation internally for better performance. Without that, we 
> need to parse the data in its format inefficiently.
> This will allow the following use cases:
> * Create an Iceberg table with a Variant column
> CREATE TABLE IF NOT EXISTS car_sales(record Variant);
> * Insert semi-structured data into the Variant column
> INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
> * Query against the semi-structured data
> SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales
> Variant Binary Encoding
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> Iceberg's Variant type proposal:
> https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to