Re: [Discuss] Efficient column updates in Iceberg

Steven Wu Fri, 06 Feb 2026 07:08:59 -0800

> 1) When deleting with eq-deletes: If there is a column update on the
equality-filed ID we use for the delete, reject deletion
> 2) When adding a column update on a column that is part of the equality
field IDs in some delete, we reject the column update


Gabor, this is a good scenario. The 2nd option makes sense to me, since
equality ids are like primary key fields. If we have the 2nd rule enforced,
the first option is not applicable anymore.

On Fri, Feb 6, 2026 at 3:13 AM Gábor Kaszab <[email protected]> wrote:

> Hey,
>
> Thank you for the proposal, Anurag! I made a pass recently and I think
> there is some interference between column updates and equality deletes. Let
> me describe below:
>
> Steps:
>
> CREATE TABLE tbl (int a, int b);
>
> INSERT INTO tbl VALUES (1, 11), (2, 22);  -- creates the base data file
>
> DELETE FROM tbl WHERE b=11;               -- creates an equality delete
> file
>
> UPDATE tbl SET b=11;                                   -- writes column
> update
>
>
>
> SELECT * FROM tbl;
>
> Expected result:
>
> (2, 11)
>
>
>
> Data and metadata created after the above steps:
>
> Base file
>
> (1, 11), (2, 22),
>
> seqnum=1
>
> EQ-delete
>
> b=11
>
> seqnum=2
>
> Column update
>
> Field ids: [field_id_for_col_b]
>
> seqnum=3
>
> Data file content: (dummy_value),(11)
>
>
>
> Read steps:
>
>    1. Stitch base file with column updates in reader:
>
> Rows: (1,dummy_value), (2,11) (Note, dummy value can be either null, or
> 11, see the proposal for more details)
>
> Seqnum for base file=1
>
> Seqnum for column update=3
>
>    2. Apply eq-delete b=11, seqnum=3 on the stitched result
>    3. Query result depends on which seqnum we carry forward to compare
>    with the eq-delete's seqnum, but it's not correct in any of the cases
>       1. Use seqnum from base file: we get either an empty result if
>       'dummy_value' is 11 or we get (1, null) otherwise
>       2. Use seqnum from last update file: don't delete any rows, result
>       set is (1, dummy_value),(2,11)
>
>
>
> Problem:
>
> EQ-delete should be applied midway applying the column updates to the base
> file based on sequence number, during the stitching process. If I'm not
> mistaken, this is not feasible with the way readers work.
>
>
> Proposal:
>
> Don't allow equality deletes together with column updates.
>
>   1) When deleting with eq-deletes: If there is a column update on the
> equality-filed ID we use for the delete, reject deletion
>
>   2) When adding a column update on a column that is part of the equality
> field IDs in some delete, we reject the column update
>
> Alternatively, column updates could be controlled by a property of the
> table (immutable), and reject eq-deletes if the property indicates column
> updates are turned on for the table
>
>
> Let me know what you think!
>
> Best Regards,
>
> Gabor
>
> Anurag Mantripragada <[email protected]> ezt írta (időpont: 2026.
> jan. 28., Sze, 3:31):
>
>> Thank you everyone for the initial review comments. It is exciting to see
>> so much interest in this proposal.
>>
>> I am currently reviewing and responding to each comment. The general
>> themes of the feedback so far include:
>> - Including partial updates (column updates on a subset of rows in a
>> table).
>> - Adding details on how SQL engines will write the update files.
>> - Adding details on split planning and row alignment for update files.
>>
>> I will think through these points and update the design accordingly.
>>
>> Best
>> Anurag
>>
>> On Tue, Jan 27, 2026 at 6:25 PM Anurag Mantripragada <
>> [email protected]> wrote:
>>
>>> Hi Xiangin,
>>>
>>> Happy to learn from your experience in supporting backfill use-cases.
>>> Please feel free to review the proposal and add your comments. I will wait
>>> for a couple of days more to ensure everyone has a chance to review the
>>> proposal.
>>>
>>> ~ Anurag
>>>
>>> On Tue, Jan 27, 2026 at 6:42 AM Xianjin Ye <[email protected]> wrote:
>>>
>>>> Hi Anurag and Peter,
>>>>
>>>> It’s great to see the partial column update has gained great interest
>>>> in the community. I internally built a BackfillColumns action to
>>>> efficiently backfill columns(by writing the partial columns only and copies
>>>> the binary data of other columns into a new DataFile). The speedup could be
>>>> 10x for wide tables but the write amplification is still there. I would be
>>>> happy to collaborate on the work and eliminate the write amplification.
>>>>
>>>> On 2026/01/27 10:12:54 Péter Váry wrote:
>>>> > Hi Anurag,
>>>> >
>>>> > It’s great to see how much interest there is in the community around
>>>> this
>>>> > potential new feature. Gábor and I have actually submitted an Iceberg
>>>> > Summit talk proposal on this topic, and we would be very happy to
>>>> > collaborate on the work. I was mainly waiting for the File Format API
>>>> to be
>>>> > finalized, as I believe this feature should build on top of it.
>>>> >
>>>> > For reference, our related work includes:
>>>> >
>>>> >    - *Dev list thread:*
>>>> >    https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
>>>> >    - *Proposal document:*
>>>> >
>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>> >    (not shared widely yet)
>>>> >    - *Performance testing PR for readers and writers:*
>>>> >    https://github.com/apache/iceberg/pull/13306
>>>> >
>>>> > During earlier discussions about possible metadata changes, another
>>>> option
>>>> > came up that hasn’t been documented yet: separating planner metadata
>>>> from
>>>> > reader metadata. Since the planner does not need to know about the
>>>> actual
>>>> > files, we could store the file composition in a separate file
>>>> (potentially
>>>> > a Puffin file). This file could hold the column_files metadata, while
>>>> the
>>>> > manifest would reference the Puffin file and blob position instead of
>>>> the
>>>> > data filename.
>>>> > This approach has the advantage of keeping the existing metadata
>>>> largely
>>>> > intact, and it could also give us a natural place later to add
>>>> file-level
>>>> > indexes or Bloom filters for use during reads or secondary filtering.
>>>> The
>>>> > downsides are the additional files and the increased complexity of
>>>> > identifying files that are no longer referenced by the table, so this
>>>> may
>>>> > not be an ideal solution.
>>>> >
>>>> > I do have some concerns about the MoR metadata proposal described in
>>>> the
>>>> > document. At first glance, it seems to complicate distributed
>>>> planning, as
>>>> > all entries for a given file would need to be collected and merged to
>>>> > provide the information required by both the planner and the reader.
>>>> > Additionally, when a new column is added or updated, we would still
>>>> need to
>>>> > add a new metadata entry for every existing data file. If we
>>>> immediately
>>>> > write out the merged metadata, the total number of entries remains the
>>>> > same. The main benefit is avoiding rewriting statistics, which can be
>>>> > significant, but this comes at the cost of increased planning
>>>> complexity.
>>>> > If we choose to store the merged statistics in the column_families
>>>> entry, I
>>>> > don’t see much benefit in excluding the rest of the metadata,
>>>> especially
>>>> > since including it would simplify the planning process.
>>>> >
>>>> > As Anton already pointed out, we should also discuss how this change
>>>> would
>>>> > affect split handling, particularly how to avoid double reads when row
>>>> > groups are not aligned between the original data files and the new
>>>> column
>>>> > files.
>>>> >
>>>> > Finally, I’d like to see some discussion around the Java API
>>>> implications.
>>>> > In particular, what API changes are required, and how SQL engines
>>>> would
>>>> > perform updates. Since the new column files must have the same number
>>>> of
>>>> > rows as the original data files, with a strict one-to-one
>>>> relationship, SQL
>>>> > engines would need access to the source filename, position, and
>>>> deletion
>>>> > status in the DataFrame in order to generate the new files. This is
>>>> more
>>>> > involved than a simple update and deserves some explicit
>>>> consideration.
>>>> >
>>>> > Looking forward to your thoughts.
>>>> > Best regards,
>>>> > Peter
>>>> >
>>>> > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <
>>>> [email protected]>
>>>> > wrote:
>>>> >
>>>> > > Thanks Anton and others, for providing some initial feedback. I will
>>>> > > address all your comments soon.
>>>> > >
>>>> > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <
>>>> [email protected]>
>>>> > > wrote:
>>>> > >
>>>> > >> I had a chance to see the proposal before it landed and I think it
>>>> is a
>>>> > >> cool idea and both presented approaches would likely work. I am
>>>> looking
>>>> > >> forward to discussing the tradeoffs and would encourage everyone to
>>>> > >> push/polish each approach to see what issues can be mitigated and
>>>> what are
>>>> > >> fundamental.
>>>> > >>
>>>> > >> [1] Iceberg-native approach: better visibility into column files
>>>> from the
>>>> > >> metadata, potentially better concurrency for non-overlapping column
>>>> > >> updates, no dep on Parquet.
>>>> > >> [2] Parquet-native approach: almost no changes to the table format
>>>> > >> metadata beyond tracking of base files.
>>>> > >>
>>>> > >> I think [1] sounds a bit better on paper but I am worried about the
>>>> > >> complexity in writers and readers (especially around keeping row
>>>> groups
>>>> > >> aligned and split planning). It would be great to cover this in
>>>> detail in
>>>> > >> the proposal.
>>>> > >>
>>>> > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
>>>> > >> [email protected]> пише:
>>>> > >>
>>>> > >>> Hi all,
>>>> > >>>
>>>> > >>> "Wide tables" with thousands of columns present significant
>>>> challenges
>>>> > >>> for AI/ML workloads, particularly when only a subset of columns
>>>> needs to be
>>>> > >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read
>>>> (MOR)
>>>> > >>> operations in Iceberg apply at the row level, which leads to
>>>> substantial
>>>> > >>> write amplification in scenarios such as:
>>>> > >>>
>>>> > >>>    - Feature Backfilling & Column Updates: Adding new feature
>>>> columns
>>>> > >>>    (e.g., model embeddings) to petabyte-scale tables.
>>>> > >>>    - Model Score Updates: Refresh prediction scores after
>>>> retraining.
>>>> > >>>    - Embedding Refresh: Updating vector embeddings, which
>>>> currently
>>>> > >>>    triggers a rewrite of the entire row.
>>>> > >>>    - Incremental Feature Computation: Daily updates to a small
>>>> fraction
>>>> > >>>    of features in wide tables.
>>>> > >>>
>>>> > >>> With the Iceberg V4 proposal introducing single-file commits and
>>>> column
>>>> > >>> stats improvements, this is an ideal time to address column-level
>>>> updates
>>>> > >>> to better support these use cases.
>>>> > >>>
>>>> > >>> I have drafted a proposal that explores both table-format
>>>> enhancements
>>>> > >>> and file-format (Parquet) changes to enable more efficient
>>>> updates.
>>>> > >>>
>>>> > >>> Proposal Details:
>>>> > >>> - GitHub Issue: #15146 <
>>>> https://github.com/apache/iceberg/issues/15146>
>>>> > >>> - Design Document: Efficient Column Updates in Iceberg
>>>> > >>> <
>>>> https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0
>>>> >
>>>> > >>>
>>>> > >>> Next Steps:
>>>> > >>> I plan to create POCs to benchmark the approaches described in the
>>>> > >>> document.
>>>> > >>>
>>>> > >>> Please review the proposal and share your feedback.
>>>> > >>>
>>>> > >>> Thanks,
>>>> > >>> Anurag
>>>> > >>>
>>>> > >>
>>>> >
>>>>
>>>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to