deniskuzZ commented on PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#issuecomment-4490253697

   > @deniskuzZ coming back to this question. Can you help me understand why we 
would want to store such sketches at the file level when we already have them 
in `NDVSketchUtil`? This PR is intentionally scoped to field-level statistics 
and replaces what we previously stored in a few Maps, so I'm not sure we want 
to store sketches of any kind in content stats.
   > 
   
   We do not want to store these sketches at the file level. The more natural 
fit would be partition-level aggregates, potentially persisted separately (e.g. 
in Parquet or another compact structure). In Hive, several optimizer features 
already rely on partition-level statistics such as NDVs and histograms, so 
there is practical value in exposing richer partition stats.
   
   That said, I am not convinced Puffin is the best fit for this use case. As 
far as I understand, it lacks a secondary index, and we have previously seen 
scalability concerns when the number of blobs becomes very large (e.g. 100k 
partitions × 100 columns). Note, there was also an idea to store column 
statistics in Puffin on a per-partition basis, with references from the 
existing partition statistics files.
   
   Also, well-designed queries should never need to load statistics for all 
partitions and all columns at once. In practice, queries typically touch only a 
relatively small subset of partitions and referenced columns, so metadata 
access should remain proportional to the actual query scope rather than the 
total table size.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to