Re: [DISCUSS] v4 - Improved column statistics

Eduard Tudenhöfner Wed, 13 May 2026 07:36:52 -0700

Hey everyone,

It's been a while since the last update but I just wanted to raise
awareness that we've been iterating on the Content Stats Spec changes in
#14234 <https://github.com/apache/iceberg/pull/14234>.
Please take a look and let me know your thoughts.


Thanks
Eduard

On Fri, Sep 19, 2025 at 9:31 AM Eduard Tudenhöfner <[email protected]>
wrote:

> Hey everyone,
>
> I have updated the proposal
> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2>
> with the following things:
>
>    - removed *column_size*, since this hasn't been used anywhere in
>    earlier versions. Please shout if you think we should keep this going
>    forward.
>    - added *avg_value_size* and *max_value_size* for avg/max value sizes
>    of variable-length types (string/binary)
>    - the examples in the proposal were using *1_417_000_000* as the
>    starting stats ID for the reserved field ID space, but that should have
>    been *2_147_000_000* because we have 200 reserved IDs * 200 stats
>    types = 40k and using *2_147_000_000* leaves enough room in case we
>    decide to add other ID spaces
>
> If people are ok then I think we should be able to vote on the design
> proposal so that we could get the first portions of the code
> <https://github.com/apache/iceberg/pull/13933> in, which would allow
> parallelizing downstream work on this
>
>
> Thanks
> Eduard
>
> On Wed, Aug 20, 2025 at 3:05 PM Eduard Tudenhöfner <
> [email protected]> wrote:
>
>> Hey everyone,
>>
>> We met yesterday and talked about some details around the stats proposal.
>>
>> Please find the notes here
>> <https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?usp=sharing>
>> and the recording here
>> <https://drive.google.com/file/d/1YIILCIhDbgu3OYlMn5KNChsYFP8rGPPX/view?usp=sharing>
>> .
>>
>> I have updated the proposal <https://s.apache.org/iceberg-column-stats>
>> with the following points:
>>
>>    - added a table schema example with a detailed stats schema
>>    - updated wording to make it clear that projection is always by ID
>>    and the field name of a stats field should not be relied on
>>    - added a table that defines current field stats types with their
>>    respective offsets from the field ID of the base stats struct
>>    - updated wording to make it clear that stats are calculated for
>>    assigned field IDs that are
>>       - defined in the table ID space (Amogh is working on a separate
>>       proposal to unify ID spaces)
>>       - defined in the reserved field ID
>>       <https://iceberg.apache.org/spec/#reserved-field-ids> space
>>    - added some examples showing table ID -> stats ID of stats struct
>>    and also the stats ID of individual stats fields
>>    - updated wording to explain how variant stats would look in the new
>>    stats structure
>>    - updated wording to make it clear that custom stats are not
>>    supported and that expressions are the preferred way
>>
>> Please let me know in case I missed anything else to include.
>>
>> Thanks everyone for participating,
>>
>> Eduard
>>
>>
>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to