Hi, Junwang We're discussing the storage of lower and upper bounds for decimal values in manifest files and their compatibility after type evolution. The bounds are stored as unscaled values without their original scale, so when the decimal type changes, we can't correctly interpret these historical bounds even though we know the current type from metadata.
Minglei. Junwang Zhao <zhjw...@gmail.com> 于2025年9月11日周四 17:46写道: > Hi Minglei, > > On Thu, Sep 11, 2025 at 5:35 PM rice Zhang <minglei...@gmail.com> wrote: > > > > Hi Ryan, > > > > Thank you for your detailed response. I've discussed this issue offline > with my team lead, and we've done some deeper investigation into the > problem. After reviewing the Decimal Type serialization code in Iceberg, we > confirmed that currently only the unscaled value is serialized without > storing the scale value. This indeed makes type evolution more complex than > initially anticipated. Regarding your mention of v4 adopting columnar > metadata for manifests, while I'm not certain which specific format Iceberg > will use (perhaps Parquet?), I agree this is a positive direction. However, > to properly support decimal scale evolution, I believe Iceberg would need > to fundamentally change how decimal types are serialized, regardless of > whether using Avro or Parquet. Specifically, we'd need to serialize both > the unscaled value AND the scale, not just the unscaled value. > > > > Here's an example: Consider a field initially defined as DECIMAL(5,2) > with value 123.45 (the serialized unscaled value is 12345). If a user later > changes the type to DECIMAL(6,3) - which follows SQL:2011 rules since (p-s) > doesn't decrease - reading the old data with the new type would be > problematic. Without the original scale being serialized, we can't > distinguish whether 12345 represents 123.45 (scale=2) or 12.345 (scale=3), > potentially leading to incorrect data interpretation. By serializing the > scale alongside the unscaled value, we could correctly read 12345 with > scale=2 as 123.450 under the new DECIMAL(6,3) type, avoiding data > corruption. > > The metadata should have the data type, which includes the scale and > precision, isn't that enough to describe the decimal? Correct me if > I'm wrong :) > > > > > I'd like to confirm whether this approach of serializing the scale value > is something you consider viable? Or does the community have other better > solutions for supporting decimal scale evolution? Also, I'm wondering if > you've already discussed specific implementation approaches for decimal > type changes? I'm very interested in understanding how v4 plans to address > this issue. > > > > Minglei > > > > Ryan Blue <rdb...@gmail.com> 于2025年9月11日周四 03:53写道: > >> > >> Hi Minglei, thanks for the proposal. > >> > >> v3 is now closed, so we can't introduce a breaking change like this > until v4. We looked into decimal type evolution in v3 and found that due to > the way that we currently store lower and upper bounds for decimal values, > we can't safely support this in v3 Iceberg manifests. We will need to wait > until v4 manifests are introduced with columnar metadata to make this > change. > >> > >> Ryan > >> > >> On Wed, Sep 10, 2025 at 12:28 AM rice Zhang <minglei...@gmail.com> > wrote: > >>> > >>> Hi Iceberg Community, > >>> > >>> I'd like to propose extending Iceberg's type promotion rules to > support DECIMAL type evolution with scale changes, aligning with the > SQL:2011 standard. > >>> > >>> Current Limitation > >>> Currently, Iceberg only supports DECIMAL type promotion when: > >>> - Scale remains the same > >>> - Precision can be increased > >>> > >>> This means DECIMAL(10,2) can evolve to DECIMAL(12,2), but not to > DECIMAL(12,4). > >>> > >>> Proposed Change > >>> Allow DECIMAL type evolution when: > >>> 1. Target scale >= source scale > >>> 2. Target precision >= source precision > >>> 3. Integer part capacity is preserved: (target_precision - > target_scale) >= (source_precision - source_scale) > >>> > >>> Examples > >>> With this change: > >>> - DECIMAL(10,2) → DECIMAL(12,4) ✓ (integer part: 8 → 8, scale: 2 → 4) > >>> - DECIMAL(10,2) → DECIMAL(15,5) ✓ (integer part: 8 → 10, scale: 2 → > 5) > >>> - DECIMAL(10,2) → DECIMAL(10,4) ✗ (integer part: 8 → 6, would lose > integer capacity) > >>> > >>> Rationale > >>> 1. SQL:2011 Compliance: This behavior aligns with SQL:2011 standard > expectations > >>> 2. User Experience: Many users coming from traditional databases > expect this type evolution to work > >>> 3. Data Safety: The proposed rules ensure no data loss - existing > values can always be represented in the new > >>> type > >>> 4. Real-world Use Cases: Common scenarios like adding more decimal > precision for currency calculations would > >>> be supported > >>> > >>> Implementation > >>> I've created a proof-of-concept implementation: > https://github.com/apache/iceberg/issues/14037 > >>> > >>> Questions for Discussion > >>> 1. Should this be part of the spec v3, or wait for a future version? > >>> 2. Are there any backward compatibility concerns we should address? > >>> > >>> Looking forward to your feedback and thoughts on this proposal. > >>> > >>> Best regards, > >>> Minglei > > > > -- > Regards > Junwang Zhao >