rdblue commented on code in PR #8202: URL: https://github.com/apache/iceberg/pull/8202#discussion_r1941610970
########## format/puffin-spec.md: ########## @@ -181,6 +181,23 @@ for Puffin v1. [roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations [roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout +#### `hive-column-statistics-obj` blob type + +A serialized form of Hive ColumnStatsObject. + +The ColumnStatsObject supports Histograms, NDV, Min and Max values, Number of nulls, Number of trues, column name, type. +A full list of supported statistics is listed in the table here: +[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) Review Comment: I don't see much value in adding this. NDV sketches are already supported using Theta sketches so this would duplicate the purpose of an existing sketch. Is there sufficient value to justify the difference? This doesn't provide enough context to tell. In addition, the Iceberg manifest format already covers value count, lower bounds, upper bounds, number of nulls, and number of NaNs. The partition statistics files provide a way to aggregate those beyond the file level, and we use snapshot summaries for table-level stats. I'm not sure what value this would provide. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
