Thank you all for the insightful clarifications on the Column Stats
Improvements proposal. Your explanations really helped me understand the
key aspects of the v4 design: This design is both elegant and practical. I
have one follow-up question: Will Iceberg follow the SQL:2011 standard for
decimal type evolution ?

Ryan Blue <rdb...@gmail.com> 于2025年9月12日周五 04:18写道:

> I think the missing piece is that the lower and upper bounds for a decimal
> value will be promoted when read from the stats struct. Say I have a stats
> struct for column dec: decimal(7,2). Initially, both lower and upper bounds
> will be stored as decimal(7,2) fields in that struct. When the type of the
> column changes, for example, decimal(7,2) to decimal(9,4), values will be
> promoted at read time in both data files and in metadata files. When new
> metadata files are written, they will use the current type. That avoids the
> need to store the scale with every value.
>
> On Thu, Sep 11, 2025 at 10:19 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Micah has it correct, since each metric is no longer stored serialized
>> binary, Instead each metric will be strongly typed.
>>
>> On Thu, Sep 11, 2025 at 10:55 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Minglei,
>>>
>>> https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>> is the original design doc which is probably more useful then the sync
>>> meeting minutes.
>>>
>>> but it doesn't seem to solve the fundamental issue we've been discussing
>>>> - the need to serialize the scale information alongside the unscaled value
>>>> to support safe decimal type evolution.
>>>
>>>
>>> As mentioned above in the thread the reorganization is getting rid of
>>> the Map<column ID, Serialized value> in favor of a shredded version that
>>> has a schema for every min/max bounds.  The example in the design doc shows
>>> only int and string, but for decimal it would have the exact precision and
>>> scale for the min/max bounds making the conversion doable.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Sep 11, 2025 at 5:30 AM rice Zhang <minglei...@gmail.com> wrote:
>>>
>>>> Hi Russell,
>>>>
>>>> Thanks for pointing me to Eduard's proposal. I think I found the
>>>> document here: 
>>>> *https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?pli=1&tab=t.v6wlpv1dix8h
>>>> <https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?pli=1&tab=t.v6wlpv1dix8h>*
>>>>
>>>> After reviewing the meeting notes and discussions, it appears this
>>>> proposal primarily focuses on restructuring the current column statistics
>>>> format (moving from multiple maps to a struct-based structure). However, I
>>>> couldn't find any specific discussion about handling decimal type scale
>>>> evolution. The proposal does make important improvements to the statistics
>>>> structure, but it doesn't seem to solve the fundamental issue we've been
>>>> discussing - the need to serialize the scale information alongside the
>>>> unscaled value to support safe decimal type evolution. Given this, I think
>>>> we need to continue discussing potential solutions for decimal scale
>>>> changes. The core problem remains: without serializing the scale, we cannot
>>>> correctly interpret historical statistics when the decimal type evolves.
>>>>
>>>> Would love to hear your thoughts on how we should proceed with
>>>> addressing this specific issue.
>>>>
>>>> Minglei
>>>>
>>>> rice Zhang <minglei...@gmail.com> 于2025年9月11日周四 19:47写道:
>>>>
>>>>> I couldn't find it in my search - would appreciate any pointers to the
>>>>> proposal or related discussions.
>>>>>
>>>>> Russell Spitzer <russell.spit...@gmail.com> 于2025年9月11日周四 19:32写道:
>>>>>
>>>>>> This has already been proposed as part of v4, see Edwards column
>>>>>> metrics expansion proposal
>>>>>>
>>>>>> On Thu, Sep 11, 2025 at 4:54 AM rice Zhang <minglei...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Junwang
>>>>>>>
>>>>>>> We're discussing the storage of lower and upper bounds for decimal
>>>>>>> values in manifest files and their compatibility after type evolution. 
>>>>>>> The
>>>>>>> bounds are stored as unscaled values without their original scale, so 
>>>>>>> when
>>>>>>> the decimal type changes, we can't correctly interpret these historical
>>>>>>> bounds even though we know the current type from metadata.
>>>>>>>
>>>>>>> Minglei.
>>>>>>>
>>>>>>> Junwang Zhao <zhjw...@gmail.com> 于2025年9月11日周四 17:46写道:
>>>>>>>
>>>>>>>> Hi Minglei,
>>>>>>>>
>>>>>>>> On Thu, Sep 11, 2025 at 5:35 PM rice Zhang <minglei...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi Ryan,
>>>>>>>> >
>>>>>>>> > Thank you for your detailed response. I've discussed this issue
>>>>>>>> offline with my team lead, and we've done some deeper investigation 
>>>>>>>> into
>>>>>>>> the problem. After reviewing the Decimal Type serialization code in
>>>>>>>> Iceberg, we confirmed that currently only the unscaled value is 
>>>>>>>> serialized
>>>>>>>> without storing the scale value. This indeed makes type evolution more
>>>>>>>> complex than initially anticipated. Regarding your mention of v4 
>>>>>>>> adopting
>>>>>>>> columnar metadata for manifests, while I'm not certain which specific
>>>>>>>> format Iceberg will use (perhaps Parquet?), I agree this is a positive
>>>>>>>> direction. However, to properly support decimal scale evolution, I 
>>>>>>>> believe
>>>>>>>> Iceberg would need to fundamentally change how decimal types are
>>>>>>>> serialized, regardless of whether using Avro or Parquet. Specifically, 
>>>>>>>> we'd
>>>>>>>> need to serialize both the unscaled value AND the scale, not just the
>>>>>>>> unscaled value.
>>>>>>>> >
>>>>>>>> > Here's an example: Consider a field initially defined as
>>>>>>>> DECIMAL(5,2) with value 123.45 (the serialized unscaled value is 
>>>>>>>> 12345). If
>>>>>>>> a user later changes the type to DECIMAL(6,3) - which follows SQL:2011
>>>>>>>> rules since (p-s) doesn't decrease - reading the old data with the new 
>>>>>>>> type
>>>>>>>> would be problematic. Without the original scale being serialized, we 
>>>>>>>> can't
>>>>>>>> distinguish whether 12345 represents 123.45 (scale=2) or 12.345 
>>>>>>>> (scale=3),
>>>>>>>> potentially leading to incorrect data interpretation. By serializing 
>>>>>>>> the
>>>>>>>> scale alongside the unscaled value, we could correctly read 12345 with
>>>>>>>> scale=2 as 123.450 under the new DECIMAL(6,3) type, avoiding data
>>>>>>>> corruption.
>>>>>>>>
>>>>>>>> The metadata should have the data type, which includes the scale and
>>>>>>>> precision, isn't that enough to describe the decimal? Correct me if
>>>>>>>> I'm wrong :)
>>>>>>>>
>>>>>>>> >
>>>>>>>> > I'd like to confirm whether this approach of serializing the
>>>>>>>> scale value is something you consider viable? Or does the community 
>>>>>>>> have
>>>>>>>> other better solutions for supporting decimal scale evolution? Also, 
>>>>>>>> I'm
>>>>>>>> wondering if you've already discussed specific implementation 
>>>>>>>> approaches
>>>>>>>> for decimal type changes? I'm very interested in understanding how v4 
>>>>>>>> plans
>>>>>>>> to address this issue.
>>>>>>>> >
>>>>>>>> > Minglei
>>>>>>>> >
>>>>>>>> > Ryan Blue <rdb...@gmail.com> 于2025年9月11日周四 03:53写道:
>>>>>>>> >>
>>>>>>>> >> Hi Minglei, thanks for the proposal.
>>>>>>>> >>
>>>>>>>> >> v3 is now closed, so we can't introduce a breaking change like
>>>>>>>> this until v4. We looked into decimal type evolution in v3 and found 
>>>>>>>> that
>>>>>>>> due to the way that we currently store lower and upper bounds for 
>>>>>>>> decimal
>>>>>>>> values, we can't safely support this in v3 Iceberg manifests. We will 
>>>>>>>> need
>>>>>>>> to wait until v4 manifests are introduced with columnar metadata to 
>>>>>>>> make
>>>>>>>> this change.
>>>>>>>> >>
>>>>>>>> >> Ryan
>>>>>>>> >>
>>>>>>>> >> On Wed, Sep 10, 2025 at 12:28 AM rice Zhang <
>>>>>>>> minglei...@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi Iceberg Community,
>>>>>>>> >>>
>>>>>>>> >>> I'd like to propose extending Iceberg's type promotion rules to
>>>>>>>> support DECIMAL type evolution with scale changes, aligning with the
>>>>>>>> SQL:2011 standard.
>>>>>>>> >>>
>>>>>>>> >>> Current Limitation
>>>>>>>> >>>   Currently, Iceberg only supports DECIMAL type promotion when:
>>>>>>>> >>>   - Scale remains the same
>>>>>>>> >>>   - Precision can be increased
>>>>>>>> >>>
>>>>>>>> >>>   This means DECIMAL(10,2) can evolve to DECIMAL(12,2), but not
>>>>>>>> to DECIMAL(12,4).
>>>>>>>> >>>
>>>>>>>> >>> Proposed Change
>>>>>>>> >>>   Allow DECIMAL type evolution when:
>>>>>>>> >>>   1. Target scale >= source scale
>>>>>>>> >>>   2. Target precision >= source precision
>>>>>>>> >>>   3. Integer part capacity is preserved: (target_precision -
>>>>>>>> target_scale) >= (source_precision - source_scale)
>>>>>>>> >>>
>>>>>>>> >>> Examples
>>>>>>>> >>>   With this change:
>>>>>>>> >>>   - DECIMAL(10,2) → DECIMAL(12,4) ✓ (integer part: 8 → 8,
>>>>>>>> scale: 2 → 4)
>>>>>>>> >>>   - DECIMAL(10,2) → DECIMAL(15,5) ✓ (integer part: 8 → 10,
>>>>>>>> scale: 2 → 5)
>>>>>>>> >>>   - DECIMAL(10,2) → DECIMAL(10,4) ✗ (integer part: 8 → 6, would
>>>>>>>> lose integer capacity)
>>>>>>>> >>>
>>>>>>>> >>> Rationale
>>>>>>>> >>>   1. SQL:2011 Compliance: This behavior aligns with SQL:2011
>>>>>>>> standard expectations
>>>>>>>> >>>   2. User Experience: Many users coming from traditional
>>>>>>>> databases expect this type evolution to work
>>>>>>>> >>>   3. Data Safety: The proposed rules ensure no data loss -
>>>>>>>> existing values can always be represented in the new
>>>>>>>> >>>   type
>>>>>>>> >>>   4. Real-world Use Cases: Common scenarios like adding more
>>>>>>>> decimal precision for currency calculations would
>>>>>>>> >>>   be supported
>>>>>>>> >>>
>>>>>>>> >>> Implementation
>>>>>>>> >>>   I've created a proof-of-concept implementation:
>>>>>>>> https://github.com/apache/iceberg/issues/14037
>>>>>>>> >>>
>>>>>>>> >>> Questions for Discussion
>>>>>>>> >>>   1. Should this be part of the spec v3, or wait for a future
>>>>>>>> version?
>>>>>>>> >>>   2. Are there any backward compatibility concerns we should
>>>>>>>> address?
>>>>>>>> >>>
>>>>>>>> >>> Looking forward to your feedback and thoughts on this proposal.
>>>>>>>> >>>
>>>>>>>> >>> Best regards,
>>>>>>>> >>> Minglei
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> Junwang Zhao
>>>>>>>>
>>>>>>>

Reply via email to