Hi Gabor, Thanks for the update.
I will take a look at the partition stat scan API PR. Regards, JB On Mon, Dec 1, 2025 at 4:02 PM Gábor Kaszab <[email protected]> wrote: > Hey Iceberg Community, > > I'd like to bump this thread to solicit some feedback on the next steps I > wrote in my previous mail. I'd also like to draw attention to the new > partition > stat scan API PR <https://github.com/apache/iceberg/pull/14640> that > could be used to read, filter and project partition stats as an initial > step. > > Any feedback is appreciated! > Gabor > > Gábor Kaszab <[email protected]> ezt írta (időpont: 2025. nov. 5., > Sze, 14:59): > >> Hey Iceberg Community, >> >> Thank you for taking a look at the proposal >> <https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM> >> and also for the feedback! First of all I'd like to apologise for the long >> delay with my response. I went through the feedback, let me give a summary >> and possible next steps: >> >> *Partition-level column stats* >> - As a starting point a scan API could come handy (with filtering, >> projection etc.) even for the existing partition stats. I've published a >> PR <https://github.com/apache/iceberg/pull/14508> to introduce such an >> API. >> - There was an ask for this recently on Slack >> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1760925647880099>, >> and also there is a GH issue >> <https://github.com/apache/iceberg/issues/11083> opened earlier. >> - It would make sense for the partition-level column stats to follow >> the new design of column stats coming with the V4 column stats >> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I> >> proposal. Doing that will allow us to project partition-level column stats >> by field and by particular stat too. Should follow-up and coordinate with >> that proposal. >> - If there is anything else I miss here, let me know. >> >> *Table-level KLL sketch* >> So far no feedback on this. There is a PR >> <https://github.com/apache/iceberg/pull/8202> for the spec changes for >> this already. This could be a nice addition, I can cover the implementation >> if there are no objections. >> >> *Table-level column stats (like min/max etc.)* >> Sp far not much feedback on this. There are open questions wrt how to >> implement this. Will wait for further feedback, putting it on hold for now >> in favor of the above 2 items. >> >> *File-level avg length and max length* >> These will be included in the V4 stats improvements >> >> *Partition-level Theta sketches for NDV* >> These seem to consume too much space even with low precision and seem to >> have limited benefits. In case there is a particular use-case for this, let >> me know! Putting it on hold for now. >> >> Any further feedback is appreciated! Thanks! >> Gabor >> >> Jacky Lee <[email protected]> ezt írta (időpont: 2025. aug. 28., Cs, >> 15:54): >> >>> Excellent proposal! >>> >>> We’ve internally augmented both table-level and partition-level >>> ColumnStatistics, and observed a 30%+ performance gain in Spark and >>> Trino query execution—largely due to improved Cost-Based Optimization >>> (CBO) effectiveness. >>> However, leveraging the v3 format presented numerous challenges (such >>> as column-type evolution and the way to save min/max values). We >>> believe adopting the v4 format would be a more robust solution. >>> >>> >>> I’ve researched this extensively and applied it in production. I’d be >>> glad to collaborate on implementing this feature if needed. >>> >>> >>> Best wishes. >>> >>> Gábor Kaszab <[email protected]> 于2025年8月28日周四 21:23写道: >>> > >>> > Hey Iceberg Community, >>> > >>> > I've been working on a proposal to extend the currently standardized >>> statistics in Iceberg, by looking into what statistics are used by some >>> query engines and trying to fill the gaps (credit also goes to Denys K to >>> lay groundwork). The motivation is to use Iceberg for the source of truth >>> when it comes to statistics across all the engines. >>> > Meanwhile, there have been movements on other proposals (Restructuring >>> col-stats, Restructuring metadata) that might overlap with mine. Let’s see >>> how much of my proposal still holds up in light of these developments. >>> > >>> > Any feedback is appreciated! >>> > Gabor >>> >>
