deniskuzZ commented on PR #14234: URL: https://github.com/apache/iceberg/pull/14234#issuecomment-4490253697
> @deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in `NDVSketchUtil`? This PR is intentionally scoped to field-level statistics and replaces what we previously stored in a few Maps, so I'm not sure we want to store sketches of any kind in content stats. > We do not want to store these sketches at the file level. The more natural fit would be partition-level aggregates, potentially persisted separately (e.g. in Parquet or another compact structure). In Hive, several optimizer features already rely on partition-level statistics such as NDVs and histograms, so there is practical value in exposing richer partition stats. That said, I am not convinced Puffin is the best fit for this use case. As far as I understand, it lacks a secondary index, and we have previously seen scalability concerns when the number of blobs becomes very large (e.g. 100k partitions × 100 columns). Note, there was also an idea to store column statistics in Puffin on a per-partition basis, with references from the existing partition statistics files. Also, well-designed queries should never need to load statistics for all partitions and all columns at once. In practice, queries typically touch only a relatively small subset of partitions and referenced columns, so metadata access should remain proportional to the actual query scope rather than the total table size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
