Hi Everyone, We'd like to discuss an extension to the supported blob types in puffin spec.
Hive-4 uses statistics auto-generation to optimize Iceberg query performance. Column statistics are written to puffin files per snapshot. The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type [1]. Full list of supported stats here: [2] These statistics are stored as a Hive ColumnStatistics object, which is serialized and saved as a blob in puffin. You can refer to the code here for more information: [3] Currently, this object is supported by Hive and partially by Impala as well: [4]. We also plan to incorporate the KLL datasketch for histograms. As a result, we are looking to add ColumnStatistics object and KLL datasketch as standard blob types for the puffin file. [5] Main PR: https://github.com/apache/iceberg/pull/8202 Doc related PR: https://github.com/apache/iceberg/pull/8202 Any feedback would be greatly appreciated. Regards, Denys [1] https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics [2] https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L854C17-L854C30 [3] https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L649 [4] https://github.com/apache/impala/blob/ee069687fcaa06c29404e2220ff577767d905a98/fe/src/main/java/org/apache/impala/catalog/Table.java#L461 [5] https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java