Hi Everyone,

We'd like to discuss an extension to the supported blob types in puffin spec.

Hive-4 uses statistics auto-generation to optimize Iceberg query
performance. Column statistics are written to puffin files per
snapshot.

The statistics calculated by Hive include histograms, NDV (Number of
Distinct Values), Min and Max values, the number of nulls, the number
of true values, column name, and column type [1].
Full list of supported stats here: [2]

These statistics are stored as a Hive ColumnStatistics object, which
is serialized and saved as a blob in puffin. You can refer to the code
here for more information: [3]

Currently, this object is supported by Hive and partially by Impala as
well: [4]. We also plan to incorporate the KLL datasketch for
histograms.

As a result, we are looking to add ColumnStatistics object and KLL
datasketch as standard blob types for the puffin file. [5]

Main PR:
https://github.com/apache/iceberg/pull/8202

Doc related PR:
https://github.com/apache/iceberg/pull/8202

Any feedback would be greatly appreciated.

Regards,
Denys

[1] 
https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics
[2] 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L854C17-L854C30
[3] 
https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L649
[4] 
https://github.com/apache/impala/blob/ee069687fcaa06c29404e2220ff577767d905a98/fe/src/main/java/org/apache/impala/catalog/Table.java#L461
[5] 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

Reply via email to