Hi Anurag and Peter, It’s great to see the partial column update has gained great interest in the community. I internally built a BackfillColumns action to efficiently backfill columns(by writing the partial columns only and copies the binary data of other columns into a new DataFile). The speedup could be 10x for wide tables but the write amplification is still there. I would be happy to collaborate on the work and eliminate the write amplification.
On 2026/01/27 10:12:54 Péter Váry wrote: > Hi Anurag, > > It’s great to see how much interest there is in the community around this > potential new feature. Gábor and I have actually submitted an Iceberg > Summit talk proposal on this topic, and we would be very happy to > collaborate on the work. I was mainly waiting for the File Format API to be > finalized, as I believe this feature should build on top of it. > > For reference, our related work includes: > > - *Dev list thread:* > https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 > - *Proposal document:* > > https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww > (not shared widely yet) > - *Performance testing PR for readers and writers:* > https://github.com/apache/iceberg/pull/13306 > > During earlier discussions about possible metadata changes, another option > came up that hasn’t been documented yet: separating planner metadata from > reader metadata. Since the planner does not need to know about the actual > files, we could store the file composition in a separate file (potentially > a Puffin file). This file could hold the column_files metadata, while the > manifest would reference the Puffin file and blob position instead of the > data filename. > This approach has the advantage of keeping the existing metadata largely > intact, and it could also give us a natural place later to add file-level > indexes or Bloom filters for use during reads or secondary filtering. The > downsides are the additional files and the increased complexity of > identifying files that are no longer referenced by the table, so this may > not be an ideal solution. > > I do have some concerns about the MoR metadata proposal described in the > document. At first glance, it seems to complicate distributed planning, as > all entries for a given file would need to be collected and merged to > provide the information required by both the planner and the reader. > Additionally, when a new column is added or updated, we would still need to > add a new metadata entry for every existing data file. If we immediately > write out the merged metadata, the total number of entries remains the > same. The main benefit is avoiding rewriting statistics, which can be > significant, but this comes at the cost of increased planning complexity. > If we choose to store the merged statistics in the column_families entry, I > don’t see much benefit in excluding the rest of the metadata, especially > since including it would simplify the planning process. > > As Anton already pointed out, we should also discuss how this change would > affect split handling, particularly how to avoid double reads when row > groups are not aligned between the original data files and the new column > files. > > Finally, I’d like to see some discussion around the Java API implications. > In particular, what API changes are required, and how SQL engines would > perform updates. Since the new column files must have the same number of > rows as the original data files, with a strict one-to-one relationship, SQL > engines would need access to the source filename, position, and deletion > status in the DataFrame in order to generate the new files. This is more > involved than a simple update and deserves some explicit consideration. > > Looking forward to your thoughts. > Best regards, > Peter > > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <[email protected]> > wrote: > > > Thanks Anton and others, for providing some initial feedback. I will > > address all your comments soon. > > > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]> > > wrote: > > > >> I had a chance to see the proposal before it landed and I think it is a > >> cool idea and both presented approaches would likely work. I am looking > >> forward to discussing the tradeoffs and would encourage everyone to > >> push/polish each approach to see what issues can be mitigated and what are > >> fundamental. > >> > >> [1] Iceberg-native approach: better visibility into column files from the > >> metadata, potentially better concurrency for non-overlapping column > >> updates, no dep on Parquet. > >> [2] Parquet-native approach: almost no changes to the table format > >> metadata beyond tracking of base files. > >> > >> I think [1] sounds a bit better on paper but I am worried about the > >> complexity in writers and readers (especially around keeping row groups > >> aligned and split planning). It would be great to cover this in detail in > >> the proposal. > >> > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < > >> [email protected]> пише: > >> > >>> Hi all, > >>> > >>> "Wide tables" with thousands of columns present significant challenges > >>> for AI/ML workloads, particularly when only a subset of columns needs to > >>> be > >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) > >>> operations in Iceberg apply at the row level, which leads to substantial > >>> write amplification in scenarios such as: > >>> > >>> - Feature Backfilling & Column Updates: Adding new feature columns > >>> (e.g., model embeddings) to petabyte-scale tables. > >>> - Model Score Updates: Refresh prediction scores after retraining. > >>> - Embedding Refresh: Updating vector embeddings, which currently > >>> triggers a rewrite of the entire row. > >>> - Incremental Feature Computation: Daily updates to a small fraction > >>> of features in wide tables. > >>> > >>> With the Iceberg V4 proposal introducing single-file commits and column > >>> stats improvements, this is an ideal time to address column-level updates > >>> to better support these use cases. > >>> > >>> I have drafted a proposal that explores both table-format enhancements > >>> and file-format (Parquet) changes to enable more efficient updates. > >>> > >>> Proposal Details: > >>> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146> > >>> - Design Document: Efficient Column Updates in Iceberg > >>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> > >>> > >>> Next Steps: > >>> I plan to create POCs to benchmark the approaches described in the > >>> document. > >>> > >>> Please review the proposal and share your feedback. > >>> > >>> Thanks, > >>> Anurag > >>> > >> >
