Re: [Discuss] Efficient column updates in Iceberg

Anurag Mantripragada Mon, 09 Feb 2026 11:57:57 -0800

Hi all,

This design
<https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
will be discussed tomorrow in a dedicated sync.


Efficient column updates sync
Tuesday, February 10 · 9:00 – 10:00am
Time zone: America/Los_Angeles
Google Meet joining info
Video call link: https://meet.google.com/xsd-exug-tcd

~ Anurag

On Fri, Feb 6, 2026 at 8:30 AM Anurag Mantripragada <
[email protected]> wrote:

> Hi Gabor,
>
> Thanks for the detailed example.
>
> I agree with Steven that Option 2 seems reasonable. I will add a section
> to the design doc regarding equality delete handling, and we can discuss
> this further during our meeting on Tuesday.
>
> ~Anurag
>
> On Fri, Feb 6, 2026 at 7:08 AM Steven Wu <[email protected]> wrote:
>
>> > 1) When deleting with eq-deletes: If there is a column update on the
>> equality-filed ID we use for the delete, reject deletion
>> > 2) When adding a column update on a column that is part of the
>> equality field IDs in some delete, we reject the column update
>>
>> Gabor, this is a good scenario. The 2nd option makes sense to me, since
>> equality ids are like primary key fields. If we have the 2nd rule enforced,
>> the first option is not applicable anymore.
>>
>> On Fri, Feb 6, 2026 at 3:13 AM Gábor Kaszab <[email protected]>
>> wrote:
>>
>>> Hey,
>>>
>>> Thank you for the proposal, Anurag! I made a pass recently and I think
>>> there is some interference between column updates and equality deletes. Let
>>> me describe below:
>>>
>>> Steps:
>>>
>>> CREATE TABLE tbl (int a, int b);
>>>
>>> INSERT INTO tbl VALUES (1, 11), (2, 22);  -- creates the base data file
>>>
>>> DELETE FROM tbl WHERE b=11;               -- creates an equality delete
>>> file
>>>
>>> UPDATE tbl SET b=11;                                   -- writes column
>>> update
>>>
>>>
>>>
>>> SELECT * FROM tbl;
>>>
>>> Expected result:
>>>
>>> (2, 11)
>>>
>>>
>>>
>>> Data and metadata created after the above steps:
>>>
>>> Base file
>>>
>>> (1, 11), (2, 22),
>>>
>>> seqnum=1
>>>
>>> EQ-delete
>>>
>>> b=11
>>>
>>> seqnum=2
>>>
>>> Column update
>>>
>>> Field ids: [field_id_for_col_b]
>>>
>>> seqnum=3
>>>
>>> Data file content: (dummy_value),(11)
>>>
>>>
>>>
>>> Read steps:
>>>
>>>    1. Stitch base file with column updates in reader:
>>>
>>> Rows: (1,dummy_value), (2,11) (Note, dummy value can be either null, or
>>> 11, see the proposal for more details)
>>>
>>> Seqnum for base file=1
>>>
>>> Seqnum for column update=3
>>>
>>>    2. Apply eq-delete b=11, seqnum=3 on the stitched result
>>>    3. Query result depends on which seqnum we carry forward to compare
>>>    with the eq-delete's seqnum, but it's not correct in any of the cases
>>>       1. Use seqnum from base file: we get either an empty result if
>>>       'dummy_value' is 11 or we get (1, null) otherwise
>>>       2. Use seqnum from last update file: don't delete any rows,
>>>       result set is (1, dummy_value),(2,11)
>>>
>>>
>>>
>>> Problem:
>>>
>>> EQ-delete should be applied midway applying the column updates to the
>>> base file based on sequence number, during the stitching process. If I'm
>>> not mistaken, this is not feasible with the way readers work.
>>>
>>>
>>> Proposal:
>>>
>>> Don't allow equality deletes together with column updates.
>>>
>>>   1) When deleting with eq-deletes: If there is a column update on the
>>> equality-filed ID we use for the delete, reject deletion
>>>
>>>   2) When adding a column update on a column that is part of the
>>> equality field IDs in some delete, we reject the column update
>>>
>>> Alternatively, column updates could be controlled by a property of the
>>> table (immutable), and reject eq-deletes if the property indicates column
>>> updates are turned on for the table
>>>
>>>
>>> Let me know what you think!
>>>
>>> Best Regards,
>>>
>>> Gabor
>>>
>>> Anurag Mantripragada <[email protected]> ezt írta (időpont:
>>> 2026. jan. 28., Sze, 3:31):
>>>
>>>> Thank you everyone for the initial review comments. It is exciting to
>>>> see so much interest in this proposal.
>>>>
>>>> I am currently reviewing and responding to each comment. The general
>>>> themes of the feedback so far include:
>>>> - Including partial updates (column updates on a subset of rows in a
>>>> table).
>>>> - Adding details on how SQL engines will write the update files.
>>>> - Adding details on split planning and row alignment for update files.
>>>>
>>>> I will think through these points and update the design accordingly.
>>>>
>>>> Best
>>>> Anurag
>>>>
>>>> On Tue, Jan 27, 2026 at 6:25 PM Anurag Mantripragada <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Xiangin,
>>>>>
>>>>> Happy to learn from your experience in supporting backfill use-cases.
>>>>> Please feel free to review the proposal and add your comments. I will wait
>>>>> for a couple of days more to ensure everyone has a chance to review the
>>>>> proposal.
>>>>>
>>>>> ~ Anurag
>>>>>
>>>>> On Tue, Jan 27, 2026 at 6:42 AM Xianjin Ye <[email protected]> wrote:
>>>>>
>>>>>> Hi Anurag and Peter,
>>>>>>
>>>>>> It’s great to see the partial column update has gained great interest
>>>>>> in the community. I internally built a BackfillColumns action to
>>>>>> efficiently backfill columns(by writing the partial columns only and 
>>>>>> copies
>>>>>> the binary data of other columns into a new DataFile). The speedup could 
>>>>>> be
>>>>>> 10x for wide tables but the write amplification is still there. I would 
>>>>>> be
>>>>>> happy to collaborate on the work and eliminate the write amplification.
>>>>>>
>>>>>> On 2026/01/27 10:12:54 Péter Váry wrote:
>>>>>> > Hi Anurag,
>>>>>> >
>>>>>> > It’s great to see how much interest there is in the community
>>>>>> around this
>>>>>> > potential new feature. Gábor and I have actually submitted an
>>>>>> Iceberg
>>>>>> > Summit talk proposal on this topic, and we would be very happy to
>>>>>> > collaborate on the work. I was mainly waiting for the File Format
>>>>>> API to be
>>>>>> > finalized, as I believe this feature should build on top of it.
>>>>>> >
>>>>>> > For reference, our related work includes:
>>>>>> >
>>>>>> >    - *Dev list thread:*
>>>>>> >    https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
>>>>>> >    - *Proposal document:*
>>>>>> >
>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>>>> >    (not shared widely yet)
>>>>>> >    - *Performance testing PR for readers and writers:*
>>>>>> >    https://github.com/apache/iceberg/pull/13306
>>>>>> >
>>>>>> > During earlier discussions about possible metadata changes, another
>>>>>> option
>>>>>> > came up that hasn’t been documented yet: separating planner
>>>>>> metadata from
>>>>>> > reader metadata. Since the planner does not need to know about the
>>>>>> actual
>>>>>> > files, we could store the file composition in a separate file
>>>>>> (potentially
>>>>>> > a Puffin file). This file could hold the column_files metadata,
>>>>>> while the
>>>>>> > manifest would reference the Puffin file and blob position instead
>>>>>> of the
>>>>>> > data filename.
>>>>>> > This approach has the advantage of keeping the existing metadata
>>>>>> largely
>>>>>> > intact, and it could also give us a natural place later to add
>>>>>> file-level
>>>>>> > indexes or Bloom filters for use during reads or secondary
>>>>>> filtering. The
>>>>>> > downsides are the additional files and the increased complexity of
>>>>>> > identifying files that are no longer referenced by the table, so
>>>>>> this may
>>>>>> > not be an ideal solution.
>>>>>> >
>>>>>> > I do have some concerns about the MoR metadata proposal described
>>>>>> in the
>>>>>> > document. At first glance, it seems to complicate distributed
>>>>>> planning, as
>>>>>> > all entries for a given file would need to be collected and merged
>>>>>> to
>>>>>> > provide the information required by both the planner and the reader.
>>>>>> > Additionally, when a new column is added or updated, we would still
>>>>>> need to
>>>>>> > add a new metadata entry for every existing data file. If we
>>>>>> immediately
>>>>>> > write out the merged metadata, the total number of entries remains
>>>>>> the
>>>>>> > same. The main benefit is avoiding rewriting statistics, which can
>>>>>> be
>>>>>> > significant, but this comes at the cost of increased planning
>>>>>> complexity.
>>>>>> > If we choose to store the merged statistics in the column_families
>>>>>> entry, I
>>>>>> > don’t see much benefit in excluding the rest of the metadata,
>>>>>> especially
>>>>>> > since including it would simplify the planning process.
>>>>>> >
>>>>>> > As Anton already pointed out, we should also discuss how this
>>>>>> change would
>>>>>> > affect split handling, particularly how to avoid double reads when
>>>>>> row
>>>>>> > groups are not aligned between the original data files and the new
>>>>>> column
>>>>>> > files.
>>>>>> >
>>>>>> > Finally, I’d like to see some discussion around the Java API
>>>>>> implications.
>>>>>> > In particular, what API changes are required, and how SQL engines
>>>>>> would
>>>>>> > perform updates. Since the new column files must have the same
>>>>>> number of
>>>>>> > rows as the original data files, with a strict one-to-one
>>>>>> relationship, SQL
>>>>>> > engines would need access to the source filename, position, and
>>>>>> deletion
>>>>>> > status in the DataFrame in order to generate the new files. This is
>>>>>> more
>>>>>> > involved than a simple update and deserves some explicit
>>>>>> consideration.
>>>>>> >
>>>>>> > Looking forward to your thoughts.
>>>>>> > Best regards,
>>>>>> > Peter
>>>>>> >
>>>>>> > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <
>>>>>> [email protected]>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > Thanks Anton and others, for providing some initial feedback. I
>>>>>> will
>>>>>> > > address all your comments soon.
>>>>>> > >
>>>>>> > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <
>>>>>> [email protected]>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > >> I had a chance to see the proposal before it landed and I think
>>>>>> it is a
>>>>>> > >> cool idea and both presented approaches would likely work. I am
>>>>>> looking
>>>>>> > >> forward to discussing the tradeoffs and would encourage everyone
>>>>>> to
>>>>>> > >> push/polish each approach to see what issues can be mitigated
>>>>>> and what are
>>>>>> > >> fundamental.
>>>>>> > >>
>>>>>> > >> [1] Iceberg-native approach: better visibility into column files
>>>>>> from the
>>>>>> > >> metadata, potentially better concurrency for non-overlapping
>>>>>> column
>>>>>> > >> updates, no dep on Parquet.
>>>>>> > >> [2] Parquet-native approach: almost no changes to the table
>>>>>> format
>>>>>> > >> metadata beyond tracking of base files.
>>>>>> > >>
>>>>>> > >> I think [1] sounds a bit better on paper but I am worried about
>>>>>> the
>>>>>> > >> complexity in writers and readers (especially around keeping row
>>>>>> groups
>>>>>> > >> aligned and split planning). It would be great to cover this in
>>>>>> detail in
>>>>>> > >> the proposal.
>>>>>> > >>
>>>>>> > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
>>>>>> > >> [email protected]> пише:
>>>>>> > >>
>>>>>> > >>> Hi all,
>>>>>> > >>>
>>>>>> > >>> "Wide tables" with thousands of columns present significant
>>>>>> challenges
>>>>>> > >>> for AI/ML workloads, particularly when only a subset of columns
>>>>>> needs to be
>>>>>> > >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read
>>>>>> (MOR)
>>>>>> > >>> operations in Iceberg apply at the row level, which leads to
>>>>>> substantial
>>>>>> > >>> write amplification in scenarios such as:
>>>>>> > >>>
>>>>>> > >>>    - Feature Backfilling & Column Updates: Adding new feature
>>>>>> columns
>>>>>> > >>>    (e.g., model embeddings) to petabyte-scale tables.
>>>>>> > >>>    - Model Score Updates: Refresh prediction scores after
>>>>>> retraining.
>>>>>> > >>>    - Embedding Refresh: Updating vector embeddings, which
>>>>>> currently
>>>>>> > >>>    triggers a rewrite of the entire row.
>>>>>> > >>>    - Incremental Feature Computation: Daily updates to a small
>>>>>> fraction
>>>>>> > >>>    of features in wide tables.
>>>>>> > >>>
>>>>>> > >>> With the Iceberg V4 proposal introducing single-file commits
>>>>>> and column
>>>>>> > >>> stats improvements, this is an ideal time to address
>>>>>> column-level updates
>>>>>> > >>> to better support these use cases.
>>>>>> > >>>
>>>>>> > >>> I have drafted a proposal that explores both table-format
>>>>>> enhancements
>>>>>> > >>> and file-format (Parquet) changes to enable more efficient
>>>>>> updates.
>>>>>> > >>>
>>>>>> > >>> Proposal Details:
>>>>>> > >>> - GitHub Issue: #15146 <
>>>>>> https://github.com/apache/iceberg/issues/15146>
>>>>>> > >>> - Design Document: Efficient Column Updates in Iceberg
>>>>>> > >>> <
>>>>>> https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0
>>>>>> >
>>>>>> > >>>
>>>>>> > >>> Next Steps:
>>>>>> > >>> I plan to create POCs to benchmark the approaches described in
>>>>>> the
>>>>>> > >>> document.
>>>>>> > >>>
>>>>>> > >>> Please review the proposal and share your feedback.
>>>>>> > >>>
>>>>>> > >>> Thanks,
>>>>>> > >>> Anurag
>>>>>> > >>>
>>>>>> > >>
>>>>>> >
>>>>>>
>>>>>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to