Hey Iceberg Community, I'd like to bump this thread to solicit some feedback on the next steps I wrote in my previous mail. I'd also like to draw attention to the new partition stat scan API PR <https://github.com/apache/iceberg/pull/14640> that could be used to read, filter and project partition stats as an initial step.
Any feedback is appreciated! Gabor Gábor Kaszab <[email protected]> ezt írta (időpont: 2025. nov. 5., Sze, 14:59): > Hey Iceberg Community, > > Thank you for taking a look at the proposal > <https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM> > and also for the feedback! First of all I'd like to apologise for the long > delay with my response. I went through the feedback, let me give a summary > and possible next steps: > > *Partition-level column stats* > - As a starting point a scan API could come handy (with filtering, > projection etc.) even for the existing partition stats. I've published a > PR <https://github.com/apache/iceberg/pull/14508> to introduce such an > API. > - There was an ask for this recently on Slack > <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1760925647880099>, > and also there is a GH issue > <https://github.com/apache/iceberg/issues/11083> opened earlier. > - It would make sense for the partition-level column stats to follow the > new design of column stats coming with the V4 column stats > <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I> > proposal. Doing that will allow us to project partition-level column stats > by field and by particular stat too. Should follow-up and coordinate with > that proposal. > - If there is anything else I miss here, let me know. > > *Table-level KLL sketch* > So far no feedback on this. There is a PR > <https://github.com/apache/iceberg/pull/8202> for the spec changes for > this already. This could be a nice addition, I can cover the implementation > if there are no objections. > > *Table-level column stats (like min/max etc.)* > Sp far not much feedback on this. There are open questions wrt how to > implement this. Will wait for further feedback, putting it on hold for now > in favor of the above 2 items. > > *File-level avg length and max length* > These will be included in the V4 stats improvements > > *Partition-level Theta sketches for NDV* > These seem to consume too much space even with low precision and seem to > have limited benefits. In case there is a particular use-case for this, let > me know! Putting it on hold for now. > > Any further feedback is appreciated! Thanks! > Gabor > > Jacky Lee <[email protected]> ezt írta (időpont: 2025. aug. 28., Cs, > 15:54): > >> Excellent proposal! >> >> We’ve internally augmented both table-level and partition-level >> ColumnStatistics, and observed a 30%+ performance gain in Spark and >> Trino query execution—largely due to improved Cost-Based Optimization >> (CBO) effectiveness. >> However, leveraging the v3 format presented numerous challenges (such >> as column-type evolution and the way to save min/max values). We >> believe adopting the v4 format would be a more robust solution. >> >> >> I’ve researched this extensively and applied it in production. I’d be >> glad to collaborate on implementing this feature if needed. >> >> >> Best wishes. >> >> Gábor Kaszab <[email protected]> 于2025年8月28日周四 21:23写道: >> > >> > Hey Iceberg Community, >> > >> > I've been working on a proposal to extend the currently standardized >> statistics in Iceberg, by looking into what statistics are used by some >> query engines and trying to fill the gaps (credit also goes to Denys K to >> lay groundwork). The motivation is to use Iceberg for the source of truth >> when it comes to statistics across all the engines. >> > Meanwhile, there have been movements on other proposals (Restructuring >> col-stats, Restructuring metadata) that might overlap with mine. Let’s see >> how much of my proposal still holds up in light of these developments. >> > >> > Any feedback is appreciated! >> > Gabor >> >
