Re: [Discuss] Efficient column updates in Iceberg

Xianjin Ye Tue, 27 Jan 2026 06:42:11 -0800

Hi Anurag and Peter,

It’s great to see the partial column update has gained great interest in the 
community. I internally built a BackfillColumns action to efficiently backfill 
columns(by writing the partial columns only and copies the binary data of other 
columns into a new DataFile). The speedup could be 10x for wide tables but the 
write amplification is still there. I would be happy to collaborate on the work 
and eliminate the write amplification.


On 2026/01/27 10:12:54 Péter Váry wrote:
> Hi Anurag,
> 
> It’s great to see how much interest there is in the community around this
> potential new feature. Gábor and I have actually submitted an Iceberg
> Summit talk proposal on this topic, and we would be very happy to
> collaborate on the work. I was mainly waiting for the File Format API to be
> finalized, as I believe this feature should build on top of it.
> 
> For reference, our related work includes:
> 
>    - *Dev list thread:*
>    https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
>    - *Proposal document:*
>    
> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>    (not shared widely yet)
>    - *Performance testing PR for readers and writers:*
>    https://github.com/apache/iceberg/pull/13306
> 
> During earlier discussions about possible metadata changes, another option
> came up that hasn’t been documented yet: separating planner metadata from
> reader metadata. Since the planner does not need to know about the actual
> files, we could store the file composition in a separate file (potentially
> a Puffin file). This file could hold the column_files metadata, while the
> manifest would reference the Puffin file and blob position instead of the
> data filename.
> This approach has the advantage of keeping the existing metadata largely
> intact, and it could also give us a natural place later to add file-level
> indexes or Bloom filters for use during reads or secondary filtering. The
> downsides are the additional files and the increased complexity of
> identifying files that are no longer referenced by the table, so this may
> not be an ideal solution.
> 
> I do have some concerns about the MoR metadata proposal described in the
> document. At first glance, it seems to complicate distributed planning, as
> all entries for a given file would need to be collected and merged to
> provide the information required by both the planner and the reader.
> Additionally, when a new column is added or updated, we would still need to
> add a new metadata entry for every existing data file. If we immediately
> write out the merged metadata, the total number of entries remains the
> same. The main benefit is avoiding rewriting statistics, which can be
> significant, but this comes at the cost of increased planning complexity.
> If we choose to store the merged statistics in the column_families entry, I
> don’t see much benefit in excluding the rest of the metadata, especially
> since including it would simplify the planning process.
> 
> As Anton already pointed out, we should also discuss how this change would
> affect split handling, particularly how to avoid double reads when row
> groups are not aligned between the original data files and the new column
> files.
> 
> Finally, I’d like to see some discussion around the Java API implications.
> In particular, what API changes are required, and how SQL engines would
> perform updates. Since the new column files must have the same number of
> rows as the original data files, with a strict one-to-one relationship, SQL
> engines would need access to the source filename, position, and deletion
> status in the DataFrame in order to generate the new files. This is more
> involved than a simple update and deserves some explicit consideration.
> 
> Looking forward to your thoughts.
> Best regards,
> Peter
> 
> On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <[email protected]>
> wrote:
> 
> > Thanks Anton and others, for providing some initial feedback. I will
> > address all your comments soon.
> >
> > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]>
> > wrote:
> >
> >> I had a chance to see the proposal before it landed and I think it is a
> >> cool idea and both presented approaches would likely work. I am looking
> >> forward to discussing the tradeoffs and would encourage everyone to
> >> push/polish each approach to see what issues can be mitigated and what are
> >> fundamental.
> >>
> >> [1] Iceberg-native approach: better visibility into column files from the
> >> metadata, potentially better concurrency for non-overlapping column
> >> updates, no dep on Parquet.
> >> [2] Parquet-native approach: almost no changes to the table format
> >> metadata beyond tracking of base files.
> >>
> >> I think [1] sounds a bit better on paper but I am worried about the
> >> complexity in writers and readers (especially around keeping row groups
> >> aligned and split planning). It would be great to cover this in detail in
> >> the proposal.
> >>
> >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
> >> [email protected]> пише:
> >>
> >>> Hi all,
> >>>
> >>> "Wide tables" with thousands of columns present significant challenges
> >>> for AI/ML workloads, particularly when only a subset of columns needs to 
> >>> be
> >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
> >>> operations in Iceberg apply at the row level, which leads to substantial
> >>> write amplification in scenarios such as:
> >>>
> >>>    - Feature Backfilling & Column Updates: Adding new feature columns
> >>>    (e.g., model embeddings) to petabyte-scale tables.
> >>>    - Model Score Updates: Refresh prediction scores after retraining.
> >>>    - Embedding Refresh: Updating vector embeddings, which currently
> >>>    triggers a rewrite of the entire row.
> >>>    - Incremental Feature Computation: Daily updates to a small fraction
> >>>    of features in wide tables.
> >>>
> >>> With the Iceberg V4 proposal introducing single-file commits and column
> >>> stats improvements, this is an ideal time to address column-level updates
> >>> to better support these use cases.
> >>>
> >>> I have drafted a proposal that explores both table-format enhancements
> >>> and file-format (Parquet) changes to enable more efficient updates.
> >>>
> >>> Proposal Details:
> >>> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146>
> >>> - Design Document: Efficient Column Updates in Iceberg
> >>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
> >>>
> >>> Next Steps:
> >>> I plan to create POCs to benchmark the approaches described in the
> >>> document.
> >>>
> >>> Please review the proposal and share your feedback.
> >>>
> >>> Thanks,
> >>> Anurag
> >>>
> >>
>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to