rdblue commented on code in PR #8202:
URL: https://github.com/apache/iceberg/pull/8202#discussion_r1941610970


##########
format/puffin-spec.md:
##########
@@ -181,6 +181,23 @@ for Puffin v1.
 [roaring-bitmap-portable-serialization]: 
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
 [roaring-bitmap-general-layout]: 
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout
 
+#### `hive-column-statistics-obj` blob type
+
+A serialized form of Hive ColumnStatsObject.
+
+The ColumnStatsObject supports Histograms, NDV, Min and Max values, Number of 
nulls, Number of trues, column name, type.
+A full list of supported statistics is listed in the table here:
+[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics)

Review Comment:
   I don't see much value in adding this.
   
   NDV sketches are already supported using Theta sketches so this would 
duplicate the purpose of an existing sketch. Is there sufficient value to 
justify the difference? This doesn't provide enough context to tell.
   
   In addition, the Iceberg manifest format already covers value count, lower 
bounds, upper bounds, number of nulls, and number of NaNs. The partition 
statistics files provide a way to aggregate those beyond the file level, and we 
use snapshot summaries for table-level stats. I'm not sure what value this 
would provide.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to