Re: Proposal to extend standardized statistics

Jean-Baptiste Onofré Mon, 01 Dec 2025 22:48:57 -0800

Hi Gabor,

Thanks for the update.


I will take a look at the partition stat scan API PR.

Regards,
JB

On Mon, Dec 1, 2025 at 4:02 PM Gábor Kaszab <[email protected]> wrote:

> Hey Iceberg Community,
>
> I'd like to bump this thread to solicit some feedback on the next steps I
> wrote in my previous mail. I'd also like to draw attention to the new 
> partition
> stat scan API PR <https://github.com/apache/iceberg/pull/14640> that
> could be used to read, filter and project partition stats as an initial
> step.
>
> Any feedback is appreciated!
> Gabor
>
> Gábor Kaszab <[email protected]> ezt írta (időpont: 2025. nov. 5.,
> Sze, 14:59):
>
>> Hey Iceberg Community,
>>
>> Thank you for taking a look at the proposal
>> <https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM>
>> and also for the feedback! First of all I'd like to apologise for the long
>> delay with my response. I went through the feedback, let me give a summary
>> and possible next steps:
>>
>> *Partition-level column stats*
>>   - As a starting point a scan API could come handy (with filtering,
>> projection etc.) even for the existing partition stats. I've published a
>> PR <https://github.com/apache/iceberg/pull/14508> to introduce such an
>> API.
>>   - There was an ask for this recently on Slack
>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1760925647880099>,
>> and also there is a GH issue
>> <https://github.com/apache/iceberg/issues/11083> opened earlier.
>>   - It would make sense for the partition-level column stats to follow
>> the new design of column stats coming with the V4 column stats
>> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I>
>> proposal. Doing that will allow us to project partition-level column stats
>> by field and by particular stat too. Should follow-up and coordinate with
>> that proposal.
>>  - If there is anything else I miss here, let me know.
>>
>> *Table-level KLL sketch*
>> So far no feedback on this. There is a PR
>> <https://github.com/apache/iceberg/pull/8202> for the spec changes for
>> this already. This could be a nice addition, I can cover the implementation
>> if there are no objections.
>>
>> *Table-level column stats (like min/max etc.)*
>> Sp far not much feedback on this. There are open questions wrt how to
>> implement this. Will wait for further feedback, putting it on hold for now
>> in favor of the above 2 items.
>>
>> *File-level avg length and max length*
>> These will be included in the V4 stats improvements
>>
>> *Partition-level Theta sketches for NDV*
>> These seem to consume too much space even with low precision and seem to
>> have limited benefits. In case there is a particular use-case for this, let
>> me know! Putting it on hold for now.
>>
>> Any further feedback is appreciated! Thanks!
>> Gabor
>>
>> Jacky Lee <[email protected]> ezt írta (időpont: 2025. aug. 28., Cs,
>> 15:54):
>>
>>> Excellent proposal!
>>>
>>> We’ve internally augmented both table-level and partition-level
>>> ColumnStatistics, and observed a 30%+ performance gain in Spark and
>>> Trino query execution—largely due to improved Cost-Based Optimization
>>> (CBO) effectiveness.
>>> However, leveraging the v3 format presented numerous challenges (such
>>> as column-type evolution and the way to save min/max values). We
>>> believe adopting the v4 format would be a more robust solution.
>>>
>>>
>>> I’ve researched this extensively and applied it in production. I’d be
>>> glad to collaborate on implementing this feature if needed.
>>>
>>>
>>> Best wishes.
>>>
>>> Gábor Kaszab <[email protected]> 于2025年8月28日周四 21:23写道：
>>> >
>>> > Hey Iceberg Community,
>>> >
>>> > I've been working on a proposal to extend the currently standardized
>>> statistics in Iceberg, by looking into what statistics are used by some
>>> query engines and trying to fill the gaps (credit also goes to Denys K to
>>> lay groundwork). The motivation is to use Iceberg for the source of truth
>>> when it comes to statistics across all the engines.
>>> > Meanwhile, there have been movements on other proposals (Restructuring
>>> col-stats, Restructuring metadata) that might overlap with mine. Let’s see
>>> how much of my proposal still holds up in light of these developments.
>>> >
>>> > Any feedback is appreciated!
>>> > Gabor
>>>
>>

Re: Proposal to extend standardized statistics

Reply via email to