Thank you all for the insightful clarifications on the Column Stats Improvements proposal. Your explanations really helped me understand the key aspects of the v4 design: This design is both elegant and practical. I have one follow-up question: Will Iceberg follow the SQL:2011 standard for decimal type evolution ?
Ryan Blue <rdb...@gmail.com> 于2025年9月12日周五 04:18写道: > I think the missing piece is that the lower and upper bounds for a decimal > value will be promoted when read from the stats struct. Say I have a stats > struct for column dec: decimal(7,2). Initially, both lower and upper bounds > will be stored as decimal(7,2) fields in that struct. When the type of the > column changes, for example, decimal(7,2) to decimal(9,4), values will be > promoted at read time in both data files and in metadata files. When new > metadata files are written, they will use the current type. That avoids the > need to store the scale with every value. > > On Thu, Sep 11, 2025 at 10:19 AM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Micah has it correct, since each metric is no longer stored serialized >> binary, Instead each metric will be strongly typed. >> >> On Thu, Sep 11, 2025 at 10:55 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Hi Minglei, >>> >>> https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2 >>> is the original design doc which is probably more useful then the sync >>> meeting minutes. >>> >>> but it doesn't seem to solve the fundamental issue we've been discussing >>>> - the need to serialize the scale information alongside the unscaled value >>>> to support safe decimal type evolution. >>> >>> >>> As mentioned above in the thread the reorganization is getting rid of >>> the Map<column ID, Serialized value> in favor of a shredded version that >>> has a schema for every min/max bounds. The example in the design doc shows >>> only int and string, but for decimal it would have the exact precision and >>> scale for the min/max bounds making the conversion doable. >>> >>> Thanks, >>> Micah >>> >>> On Thu, Sep 11, 2025 at 5:30 AM rice Zhang <minglei...@gmail.com> wrote: >>> >>>> Hi Russell, >>>> >>>> Thanks for pointing me to Eduard's proposal. I think I found the >>>> document here: >>>> *https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?pli=1&tab=t.v6wlpv1dix8h >>>> <https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?pli=1&tab=t.v6wlpv1dix8h>* >>>> >>>> After reviewing the meeting notes and discussions, it appears this >>>> proposal primarily focuses on restructuring the current column statistics >>>> format (moving from multiple maps to a struct-based structure). However, I >>>> couldn't find any specific discussion about handling decimal type scale >>>> evolution. The proposal does make important improvements to the statistics >>>> structure, but it doesn't seem to solve the fundamental issue we've been >>>> discussing - the need to serialize the scale information alongside the >>>> unscaled value to support safe decimal type evolution. Given this, I think >>>> we need to continue discussing potential solutions for decimal scale >>>> changes. The core problem remains: without serializing the scale, we cannot >>>> correctly interpret historical statistics when the decimal type evolves. >>>> >>>> Would love to hear your thoughts on how we should proceed with >>>> addressing this specific issue. >>>> >>>> Minglei >>>> >>>> rice Zhang <minglei...@gmail.com> 于2025年9月11日周四 19:47写道: >>>> >>>>> I couldn't find it in my search - would appreciate any pointers to the >>>>> proposal or related discussions. >>>>> >>>>> Russell Spitzer <russell.spit...@gmail.com> 于2025年9月11日周四 19:32写道: >>>>> >>>>>> This has already been proposed as part of v4, see Edwards column >>>>>> metrics expansion proposal >>>>>> >>>>>> On Thu, Sep 11, 2025 at 4:54 AM rice Zhang <minglei...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, Junwang >>>>>>> >>>>>>> We're discussing the storage of lower and upper bounds for decimal >>>>>>> values in manifest files and their compatibility after type evolution. >>>>>>> The >>>>>>> bounds are stored as unscaled values without their original scale, so >>>>>>> when >>>>>>> the decimal type changes, we can't correctly interpret these historical >>>>>>> bounds even though we know the current type from metadata. >>>>>>> >>>>>>> Minglei. >>>>>>> >>>>>>> Junwang Zhao <zhjw...@gmail.com> 于2025年9月11日周四 17:46写道: >>>>>>> >>>>>>>> Hi Minglei, >>>>>>>> >>>>>>>> On Thu, Sep 11, 2025 at 5:35 PM rice Zhang <minglei...@gmail.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > Hi Ryan, >>>>>>>> > >>>>>>>> > Thank you for your detailed response. I've discussed this issue >>>>>>>> offline with my team lead, and we've done some deeper investigation >>>>>>>> into >>>>>>>> the problem. After reviewing the Decimal Type serialization code in >>>>>>>> Iceberg, we confirmed that currently only the unscaled value is >>>>>>>> serialized >>>>>>>> without storing the scale value. This indeed makes type evolution more >>>>>>>> complex than initially anticipated. Regarding your mention of v4 >>>>>>>> adopting >>>>>>>> columnar metadata for manifests, while I'm not certain which specific >>>>>>>> format Iceberg will use (perhaps Parquet?), I agree this is a positive >>>>>>>> direction. However, to properly support decimal scale evolution, I >>>>>>>> believe >>>>>>>> Iceberg would need to fundamentally change how decimal types are >>>>>>>> serialized, regardless of whether using Avro or Parquet. Specifically, >>>>>>>> we'd >>>>>>>> need to serialize both the unscaled value AND the scale, not just the >>>>>>>> unscaled value. >>>>>>>> > >>>>>>>> > Here's an example: Consider a field initially defined as >>>>>>>> DECIMAL(5,2) with value 123.45 (the serialized unscaled value is >>>>>>>> 12345). If >>>>>>>> a user later changes the type to DECIMAL(6,3) - which follows SQL:2011 >>>>>>>> rules since (p-s) doesn't decrease - reading the old data with the new >>>>>>>> type >>>>>>>> would be problematic. Without the original scale being serialized, we >>>>>>>> can't >>>>>>>> distinguish whether 12345 represents 123.45 (scale=2) or 12.345 >>>>>>>> (scale=3), >>>>>>>> potentially leading to incorrect data interpretation. By serializing >>>>>>>> the >>>>>>>> scale alongside the unscaled value, we could correctly read 12345 with >>>>>>>> scale=2 as 123.450 under the new DECIMAL(6,3) type, avoiding data >>>>>>>> corruption. >>>>>>>> >>>>>>>> The metadata should have the data type, which includes the scale and >>>>>>>> precision, isn't that enough to describe the decimal? Correct me if >>>>>>>> I'm wrong :) >>>>>>>> >>>>>>>> > >>>>>>>> > I'd like to confirm whether this approach of serializing the >>>>>>>> scale value is something you consider viable? Or does the community >>>>>>>> have >>>>>>>> other better solutions for supporting decimal scale evolution? Also, >>>>>>>> I'm >>>>>>>> wondering if you've already discussed specific implementation >>>>>>>> approaches >>>>>>>> for decimal type changes? I'm very interested in understanding how v4 >>>>>>>> plans >>>>>>>> to address this issue. >>>>>>>> > >>>>>>>> > Minglei >>>>>>>> > >>>>>>>> > Ryan Blue <rdb...@gmail.com> 于2025年9月11日周四 03:53写道: >>>>>>>> >> >>>>>>>> >> Hi Minglei, thanks for the proposal. >>>>>>>> >> >>>>>>>> >> v3 is now closed, so we can't introduce a breaking change like >>>>>>>> this until v4. We looked into decimal type evolution in v3 and found >>>>>>>> that >>>>>>>> due to the way that we currently store lower and upper bounds for >>>>>>>> decimal >>>>>>>> values, we can't safely support this in v3 Iceberg manifests. We will >>>>>>>> need >>>>>>>> to wait until v4 manifests are introduced with columnar metadata to >>>>>>>> make >>>>>>>> this change. >>>>>>>> >> >>>>>>>> >> Ryan >>>>>>>> >> >>>>>>>> >> On Wed, Sep 10, 2025 at 12:28 AM rice Zhang < >>>>>>>> minglei...@gmail.com> wrote: >>>>>>>> >>> >>>>>>>> >>> Hi Iceberg Community, >>>>>>>> >>> >>>>>>>> >>> I'd like to propose extending Iceberg's type promotion rules to >>>>>>>> support DECIMAL type evolution with scale changes, aligning with the >>>>>>>> SQL:2011 standard. >>>>>>>> >>> >>>>>>>> >>> Current Limitation >>>>>>>> >>> Currently, Iceberg only supports DECIMAL type promotion when: >>>>>>>> >>> - Scale remains the same >>>>>>>> >>> - Precision can be increased >>>>>>>> >>> >>>>>>>> >>> This means DECIMAL(10,2) can evolve to DECIMAL(12,2), but not >>>>>>>> to DECIMAL(12,4). >>>>>>>> >>> >>>>>>>> >>> Proposed Change >>>>>>>> >>> Allow DECIMAL type evolution when: >>>>>>>> >>> 1. Target scale >= source scale >>>>>>>> >>> 2. Target precision >= source precision >>>>>>>> >>> 3. Integer part capacity is preserved: (target_precision - >>>>>>>> target_scale) >= (source_precision - source_scale) >>>>>>>> >>> >>>>>>>> >>> Examples >>>>>>>> >>> With this change: >>>>>>>> >>> - DECIMAL(10,2) → DECIMAL(12,4) ✓ (integer part: 8 → 8, >>>>>>>> scale: 2 → 4) >>>>>>>> >>> - DECIMAL(10,2) → DECIMAL(15,5) ✓ (integer part: 8 → 10, >>>>>>>> scale: 2 → 5) >>>>>>>> >>> - DECIMAL(10,2) → DECIMAL(10,4) ✗ (integer part: 8 → 6, would >>>>>>>> lose integer capacity) >>>>>>>> >>> >>>>>>>> >>> Rationale >>>>>>>> >>> 1. SQL:2011 Compliance: This behavior aligns with SQL:2011 >>>>>>>> standard expectations >>>>>>>> >>> 2. User Experience: Many users coming from traditional >>>>>>>> databases expect this type evolution to work >>>>>>>> >>> 3. Data Safety: The proposed rules ensure no data loss - >>>>>>>> existing values can always be represented in the new >>>>>>>> >>> type >>>>>>>> >>> 4. Real-world Use Cases: Common scenarios like adding more >>>>>>>> decimal precision for currency calculations would >>>>>>>> >>> be supported >>>>>>>> >>> >>>>>>>> >>> Implementation >>>>>>>> >>> I've created a proof-of-concept implementation: >>>>>>>> https://github.com/apache/iceberg/issues/14037 >>>>>>>> >>> >>>>>>>> >>> Questions for Discussion >>>>>>>> >>> 1. Should this be part of the spec v3, or wait for a future >>>>>>>> version? >>>>>>>> >>> 2. Are there any backward compatibility concerns we should >>>>>>>> address? >>>>>>>> >>> >>>>>>>> >>> Looking forward to your feedback and thoughts on this proposal. >>>>>>>> >>> >>>>>>>> >>> Best regards, >>>>>>>> >>> Minglei >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> Junwang Zhao >>>>>>>> >>>>>>>