Hi Denys,

Thanks for raising this! I think extending the Puffin spec with additional
columns stats would make sense.

I saw the PR for the Puffin spec at some point late last year and I also
had it in my plans to revive it in a way. My motivation is that Impala
currently uses a lot of stats from HMS, but the community is actively
working on having an HMS-free option for Impala that would use purely REST
catalog. In that use case however, we'd lose the stats we got previously
from HMS and it seemed a reasonable approach to extend the Puffin spec with
the required ones.

There is something I'm hesitant wrt the current proposal in the PR, though:
the mentioned ColumnStatictics class lives within the Hive repo and the
Iceberg lib has no support to read/write it. Basically Iceberg would
standardize those stats inot Puffin but then would have no functionality to
use them.

I have a slightly different approach in mind: I think we could gather all
the column stats needed by different engines, standardize them into the
Iceberg repo similarly to partition stats, add read/write support also
within the Iceberg repo and this way we could add them into the Puffin
spec. What do you think?

Talking about partition stats, just thinking out loud here: Aren't
partition status just a more granular way of column stats. I know initially
partition stats don't have everything that is now in ColStatistics that you
shared with us, but wouldn't it make sense to gradually extend them with
the missing ones. That way aggregating partition stats could give us the
column stats if it's feasible. Again, this is just me thinking out loud
here.

Let me know what you think!
Gabor

On Tue, Feb 4, 2025 at 1:36 PM Denys Kuzmenko <dkuzme...@apache.org> wrote:

> sorry, valid Doc PR link:
> https://github.com/apache/iceberg-docs/pull/269
>

Reply via email to