Hey Iceberg Community,

I'd like to bump this thread to solicit some feedback on the next steps I
wrote in my previous mail. I'd also like to draw attention to the new partition
stat scan API PR <https://github.com/apache/iceberg/pull/14640> that could
be used to read, filter and project partition stats as an initial step.

Any feedback is appreciated!
Gabor

Gábor Kaszab <[email protected]> ezt írta (időpont: 2025. nov. 5.,
Sze, 14:59):

> Hey Iceberg Community,
>
> Thank you for taking a look at the proposal
> <https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM>
> and also for the feedback! First of all I'd like to apologise for the long
> delay with my response. I went through the feedback, let me give a summary
> and possible next steps:
>
> *Partition-level column stats*
>   - As a starting point a scan API could come handy (with filtering,
> projection etc.) even for the existing partition stats. I've published a
> PR <https://github.com/apache/iceberg/pull/14508> to introduce such an
> API.
>   - There was an ask for this recently on Slack
> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1760925647880099>,
> and also there is a GH issue
> <https://github.com/apache/iceberg/issues/11083> opened earlier.
>   - It would make sense for the partition-level column stats to follow the
> new design of column stats coming with the V4 column stats
> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I>
> proposal. Doing that will allow us to project partition-level column stats
> by field and by particular stat too. Should follow-up and coordinate with
> that proposal.
>  - If there is anything else I miss here, let me know.
>
> *Table-level KLL sketch*
> So far no feedback on this. There is a PR
> <https://github.com/apache/iceberg/pull/8202> for the spec changes for
> this already. This could be a nice addition, I can cover the implementation
> if there are no objections.
>
> *Table-level column stats (like min/max etc.)*
> Sp far not much feedback on this. There are open questions wrt how to
> implement this. Will wait for further feedback, putting it on hold for now
> in favor of the above 2 items.
>
> *File-level avg length and max length*
> These will be included in the V4 stats improvements
>
> *Partition-level Theta sketches for NDV*
> These seem to consume too much space even with low precision and seem to
> have limited benefits. In case there is a particular use-case for this, let
> me know! Putting it on hold for now.
>
> Any further feedback is appreciated! Thanks!
> Gabor
>
> Jacky Lee <[email protected]> ezt írta (időpont: 2025. aug. 28., Cs,
> 15:54):
>
>> Excellent proposal!
>>
>> We’ve internally augmented both table-level and partition-level
>> ColumnStatistics, and observed a 30%+ performance gain in Spark and
>> Trino query execution—largely due to improved Cost-Based Optimization
>> (CBO) effectiveness.
>> However, leveraging the v3 format presented numerous challenges (such
>> as column-type evolution and the way to save min/max values). We
>> believe adopting the v4 format would be a more robust solution.
>>
>>
>> I’ve researched this extensively and applied it in production. I’d be
>> glad to collaborate on implementing this feature if needed.
>>
>>
>> Best wishes.
>>
>> Gábor Kaszab <[email protected]> 于2025年8月28日周四 21:23写道:
>> >
>> > Hey Iceberg Community,
>> >
>> > I've been working on a proposal to extend the currently standardized
>> statistics in Iceberg, by looking into what statistics are used by some
>> query engines and trying to fill the gaps (credit also goes to Denys K to
>> lay groundwork). The motivation is to use Iceberg for the source of truth
>> when it comes to statistics across all the engines.
>> > Meanwhile, there have been movements on other proposals (Restructuring
>> col-stats, Restructuring metadata) that might overlap with mine. Let’s see
>> how much of my proposal still holds up in light of these developments.
>> >
>> > Any feedback is appreciated!
>> > Gabor
>>
>

Reply via email to