Re: [Discuss] Efficient column updates in Iceberg

Anurag Mantripragada Tue, 27 Jan 2026 18:27:09 -0800

Hi Xiangin,

Happy to learn from your experience in supporting backfill use-cases.
Please feel free to review the proposal and add your comments. I will wait
for a couple of days more to ensure everyone has a chance to review the
proposal.


~ Anurag

On Tue, Jan 27, 2026 at 6:42 AM Xianjin Ye <[email protected]> wrote:

> Hi Anurag and Peter,
>
> It’s great to see the partial column update has gained great interest in
> the community. I internally built a BackfillColumns action to efficiently
> backfill columns(by writing the partial columns only and copies the binary
> data of other columns into a new DataFile). The speedup could be 10x for
> wide tables but the write amplification is still there. I would be happy to
> collaborate on the work and eliminate the write amplification.
>
> On 2026/01/27 10:12:54 Péter Váry wrote:
> > Hi Anurag,
> >
> > It’s great to see how much interest there is in the community around this
> > potential new feature. Gábor and I have actually submitted an Iceberg
> > Summit talk proposal on this topic, and we would be very happy to
> > collaborate on the work. I was mainly waiting for the File Format API to
> be
> > finalized, as I believe this feature should build on top of it.
> >
> > For reference, our related work includes:
> >
> >    - *Dev list thread:*
> >    https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
> >    - *Proposal document:*
> >
> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
> >    (not shared widely yet)
> >    - *Performance testing PR for readers and writers:*
> >    https://github.com/apache/iceberg/pull/13306
> >
> > During earlier discussions about possible metadata changes, another
> option
> > came up that hasn’t been documented yet: separating planner metadata from
> > reader metadata. Since the planner does not need to know about the actual
> > files, we could store the file composition in a separate file
> (potentially
> > a Puffin file). This file could hold the column_files metadata, while the
> > manifest would reference the Puffin file and blob position instead of the
> > data filename.
> > This approach has the advantage of keeping the existing metadata largely
> > intact, and it could also give us a natural place later to add file-level
> > indexes or Bloom filters for use during reads or secondary filtering. The
> > downsides are the additional files and the increased complexity of
> > identifying files that are no longer referenced by the table, so this may
> > not be an ideal solution.
> >
> > I do have some concerns about the MoR metadata proposal described in the
> > document. At first glance, it seems to complicate distributed planning,
> as
> > all entries for a given file would need to be collected and merged to
> > provide the information required by both the planner and the reader.
> > Additionally, when a new column is added or updated, we would still need
> to
> > add a new metadata entry for every existing data file. If we immediately
> > write out the merged metadata, the total number of entries remains the
> > same. The main benefit is avoiding rewriting statistics, which can be
> > significant, but this comes at the cost of increased planning complexity.
> > If we choose to store the merged statistics in the column_families
> entry, I
> > don’t see much benefit in excluding the rest of the metadata, especially
> > since including it would simplify the planning process.
> >
> > As Anton already pointed out, we should also discuss how this change
> would
> > affect split handling, particularly how to avoid double reads when row
> > groups are not aligned between the original data files and the new column
> > files.
> >
> > Finally, I’d like to see some discussion around the Java API
> implications.
> > In particular, what API changes are required, and how SQL engines would
> > perform updates. Since the new column files must have the same number of
> > rows as the original data files, with a strict one-to-one relationship,
> SQL
> > engines would need access to the source filename, position, and deletion
> > status in the DataFrame in order to generate the new files. This is more
> > involved than a simple update and deserves some explicit consideration.
> >
> > Looking forward to your thoughts.
> > Best regards,
> > Peter
> >
> > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <
> [email protected]>
> > wrote:
> >
> > > Thanks Anton and others, for providing some initial feedback. I will
> > > address all your comments soon.
> > >
> > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <
> [email protected]>
> > > wrote:
> > >
> > >> I had a chance to see the proposal before it landed and I think it is
> a
> > >> cool idea and both presented approaches would likely work. I am
> looking
> > >> forward to discussing the tradeoffs and would encourage everyone to
> > >> push/polish each approach to see what issues can be mitigated and
> what are
> > >> fundamental.
> > >>
> > >> [1] Iceberg-native approach: better visibility into column files from
> the
> > >> metadata, potentially better concurrency for non-overlapping column
> > >> updates, no dep on Parquet.
> > >> [2] Parquet-native approach: almost no changes to the table format
> > >> metadata beyond tracking of base files.
> > >>
> > >> I think [1] sounds a bit better on paper but I am worried about the
> > >> complexity in writers and readers (especially around keeping row
> groups
> > >> aligned and split planning). It would be great to cover this in
> detail in
> > >> the proposal.
> > >>
> > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
> > >> [email protected]> пише:
> > >>
> > >>> Hi all,
> > >>>
> > >>> "Wide tables" with thousands of columns present significant
> challenges
> > >>> for AI/ML workloads, particularly when only a subset of columns
> needs to be
> > >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
> > >>> operations in Iceberg apply at the row level, which leads to
> substantial
> > >>> write amplification in scenarios such as:
> > >>>
> > >>>    - Feature Backfilling & Column Updates: Adding new feature columns
> > >>>    (e.g., model embeddings) to petabyte-scale tables.
> > >>>    - Model Score Updates: Refresh prediction scores after retraining.
> > >>>    - Embedding Refresh: Updating vector embeddings, which currently
> > >>>    triggers a rewrite of the entire row.
> > >>>    - Incremental Feature Computation: Daily updates to a small
> fraction
> > >>>    of features in wide tables.
> > >>>
> > >>> With the Iceberg V4 proposal introducing single-file commits and
> column
> > >>> stats improvements, this is an ideal time to address column-level
> updates
> > >>> to better support these use cases.
> > >>>
> > >>> I have drafted a proposal that explores both table-format
> enhancements
> > >>> and file-format (Parquet) changes to enable more efficient updates.
> > >>>
> > >>> Proposal Details:
> > >>> - GitHub Issue: #15146 <
> https://github.com/apache/iceberg/issues/15146>
> > >>> - Design Document: Efficient Column Updates in Iceberg
> > >>> <
> https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0
> >
> > >>>
> > >>> Next Steps:
> > >>> I plan to create POCs to benchmark the approaches described in the
> > >>> document.
> > >>>
> > >>> Please review the proposal and share your feedback.
> > >>>
> > >>> Thanks,
> > >>> Anurag
> > >>>
> > >>
> >
>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to