Re: [DISCUSS] Proposal: Delta-Encoded Schemas in v4, to Address Metadata Bloat

Russell Spitzer Thu, 12 Feb 2026 10:45:33 -0800

For very wide tables, I think this becomes a problem with single digit
numbers of schema changes. My theoretical thought here is we have a table
with 1000 columns that we add new columns to every hour or so. Unless I
want to keep my history locked to 24hours (or less) schema bloat is gonna
be a pretty big issue


On Thu, Feb 12, 2026 at 10:37 AM Ryan Blue <[email protected]> wrote:

> For tables where this is a problem, how are you currently managing older
> schemas? Older schemas do not need to be kept if there aren't any snapshots
> that reference them.
>
> On Thu, Feb 12, 2026 at 10:24 AM Russell Spitzer <
> [email protected]> wrote:
>
>> My gut instinct on this is that it's a great idea. I think we probably
>> need to think a bit more about how to decide on "base" schema promotion but
>> theoretically this seems like it should be a huge benefit for wide tables.
>>
>> On Thu, Feb 12, 2026 at 7:55 AM Talat Uyarer via dev <
>> [email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I am sharing a new proposal for Iceberg Spec v4: *Delta-Encoded Schemas*
>>> . We propose moving away from monolithic schema storage to address a
>>> growing scalability bottleneck in high-velocity and ultra-wide table
>>> environments.
>>>
>>> The current Iceberg Spec re-serializes and appends the entire schema
>>> object to metadata.json for every schema operation, which leads to
>>> massive schema data replication. For a large table with 5,000 columns+
>>> with frequent schema updates, this can result in metadata files exceeding
>>> GBs, causing significant query planning latencies and OOM driver side.
>>>
>>> *Proposal Summary:*
>>>
>>> We propose implementing *Delta-Encoded Schema Evolution for Spec v4* using
>>> a *"Merge-on-Read" (MoR) approach for metadata*. This approach involves
>>> transitioning the schemas field from "Full Snapshots" to a sequence of *Base
>>> Schemas* (type full) and *Schema Deltas* (type delta) that store
>>> differential mutations relative to a base ID.
>>>
>>> *Key Goals:*
>>>
>>>    - Achieve a *99.4% reduction in the size of schema-related metadata*
>>>    .
>>>    - Drastically lower the storage and IO requirements for metadata.json
>>>    .
>>>    - Accelerate query planning by reducing the JSON payload size.
>>>    - Preserve self-containment by keeping the schema in the metadata
>>>    file, avoiding external sidecar files.
>>>
>>> The full proposal, including the flat resolution model (no delta
>>> chaining), the defined set of atomic delta operations (add, update,
>>> delete), and the lifecycle/compaction mechanics, is available for
>>> review:
>>>
>>> https://s.apache.org/iceberg-delta-schemas
>>> <https://www.google.com/url?source=gmail&sa=E&q=https://s.apache.org/iceberg-delta-schemas>
>>>
>>> I look forward to your feedback and discussion on the dev list.
>>>
>>> Thanks
>>> Talat
>>>
>>

Re: [DISCUSS] Proposal: Delta-Encoded Schemas in v4, to Address Metadata Bloat

Reply via email to