Hi Denys, Thanks for raising this! I think extending the Puffin spec with additional columns stats would make sense.
I saw the PR for the Puffin spec at some point late last year and I also had it in my plans to revive it in a way. My motivation is that Impala currently uses a lot of stats from HMS, but the community is actively working on having an HMS-free option for Impala that would use purely REST catalog. In that use case however, we'd lose the stats we got previously from HMS and it seemed a reasonable approach to extend the Puffin spec with the required ones. There is something I'm hesitant wrt the current proposal in the PR, though: the mentioned ColumnStatictics class lives within the Hive repo and the Iceberg lib has no support to read/write it. Basically Iceberg would standardize those stats inot Puffin but then would have no functionality to use them. I have a slightly different approach in mind: I think we could gather all the column stats needed by different engines, standardize them into the Iceberg repo similarly to partition stats, add read/write support also within the Iceberg repo and this way we could add them into the Puffin spec. What do you think? Talking about partition stats, just thinking out loud here: Aren't partition status just a more granular way of column stats. I know initially partition stats don't have everything that is now in ColStatistics that you shared with us, but wouldn't it make sense to gradually extend them with the missing ones. That way aggregating partition stats could give us the column stats if it's feasible. Again, this is just me thinking out loud here. Let me know what you think! Gabor On Tue, Feb 4, 2025 at 1:36 PM Denys Kuzmenko <dkuzme...@apache.org> wrote: > sorry, valid Doc PR link: > https://github.com/apache/iceberg-docs/pull/269 >