Hi Xiangin, Happy to learn from your experience in supporting backfill use-cases. Please feel free to review the proposal and add your comments. I will wait for a couple of days more to ensure everyone has a chance to review the proposal.
~ Anurag On Tue, Jan 27, 2026 at 6:42 AM Xianjin Ye <[email protected]> wrote: > Hi Anurag and Peter, > > It’s great to see the partial column update has gained great interest in > the community. I internally built a BackfillColumns action to efficiently > backfill columns(by writing the partial columns only and copies the binary > data of other columns into a new DataFile). The speedup could be 10x for > wide tables but the write amplification is still there. I would be happy to > collaborate on the work and eliminate the write amplification. > > On 2026/01/27 10:12:54 Péter Váry wrote: > > Hi Anurag, > > > > It’s great to see how much interest there is in the community around this > > potential new feature. Gábor and I have actually submitted an Iceberg > > Summit talk proposal on this topic, and we would be very happy to > > collaborate on the work. I was mainly waiting for the File Format API to > be > > finalized, as I believe this feature should build on top of it. > > > > For reference, our related work includes: > > > > - *Dev list thread:* > > https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 > > - *Proposal document:* > > > https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww > > (not shared widely yet) > > - *Performance testing PR for readers and writers:* > > https://github.com/apache/iceberg/pull/13306 > > > > During earlier discussions about possible metadata changes, another > option > > came up that hasn’t been documented yet: separating planner metadata from > > reader metadata. Since the planner does not need to know about the actual > > files, we could store the file composition in a separate file > (potentially > > a Puffin file). This file could hold the column_files metadata, while the > > manifest would reference the Puffin file and blob position instead of the > > data filename. > > This approach has the advantage of keeping the existing metadata largely > > intact, and it could also give us a natural place later to add file-level > > indexes or Bloom filters for use during reads or secondary filtering. The > > downsides are the additional files and the increased complexity of > > identifying files that are no longer referenced by the table, so this may > > not be an ideal solution. > > > > I do have some concerns about the MoR metadata proposal described in the > > document. At first glance, it seems to complicate distributed planning, > as > > all entries for a given file would need to be collected and merged to > > provide the information required by both the planner and the reader. > > Additionally, when a new column is added or updated, we would still need > to > > add a new metadata entry for every existing data file. If we immediately > > write out the merged metadata, the total number of entries remains the > > same. The main benefit is avoiding rewriting statistics, which can be > > significant, but this comes at the cost of increased planning complexity. > > If we choose to store the merged statistics in the column_families > entry, I > > don’t see much benefit in excluding the rest of the metadata, especially > > since including it would simplify the planning process. > > > > As Anton already pointed out, we should also discuss how this change > would > > affect split handling, particularly how to avoid double reads when row > > groups are not aligned between the original data files and the new column > > files. > > > > Finally, I’d like to see some discussion around the Java API > implications. > > In particular, what API changes are required, and how SQL engines would > > perform updates. Since the new column files must have the same number of > > rows as the original data files, with a strict one-to-one relationship, > SQL > > engines would need access to the source filename, position, and deletion > > status in the DataFrame in order to generate the new files. This is more > > involved than a simple update and deserves some explicit consideration. > > > > Looking forward to your thoughts. > > Best regards, > > Peter > > > > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada < > [email protected]> > > wrote: > > > > > Thanks Anton and others, for providing some initial feedback. I will > > > address all your comments soon. > > > > > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi < > [email protected]> > > > wrote: > > > > > >> I had a chance to see the proposal before it landed and I think it is > a > > >> cool idea and both presented approaches would likely work. I am > looking > > >> forward to discussing the tradeoffs and would encourage everyone to > > >> push/polish each approach to see what issues can be mitigated and > what are > > >> fundamental. > > >> > > >> [1] Iceberg-native approach: better visibility into column files from > the > > >> metadata, potentially better concurrency for non-overlapping column > > >> updates, no dep on Parquet. > > >> [2] Parquet-native approach: almost no changes to the table format > > >> metadata beyond tracking of base files. > > >> > > >> I think [1] sounds a bit better on paper but I am worried about the > > >> complexity in writers and readers (especially around keeping row > groups > > >> aligned and split planning). It would be great to cover this in > detail in > > >> the proposal. > > >> > > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < > > >> [email protected]> пише: > > >> > > >>> Hi all, > > >>> > > >>> "Wide tables" with thousands of columns present significant > challenges > > >>> for AI/ML workloads, particularly when only a subset of columns > needs to be > > >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) > > >>> operations in Iceberg apply at the row level, which leads to > substantial > > >>> write amplification in scenarios such as: > > >>> > > >>> - Feature Backfilling & Column Updates: Adding new feature columns > > >>> (e.g., model embeddings) to petabyte-scale tables. > > >>> - Model Score Updates: Refresh prediction scores after retraining. > > >>> - Embedding Refresh: Updating vector embeddings, which currently > > >>> triggers a rewrite of the entire row. > > >>> - Incremental Feature Computation: Daily updates to a small > fraction > > >>> of features in wide tables. > > >>> > > >>> With the Iceberg V4 proposal introducing single-file commits and > column > > >>> stats improvements, this is an ideal time to address column-level > updates > > >>> to better support these use cases. > > >>> > > >>> I have drafted a proposal that explores both table-format > enhancements > > >>> and file-format (Parquet) changes to enable more efficient updates. > > >>> > > >>> Proposal Details: > > >>> - GitHub Issue: #15146 < > https://github.com/apache/iceberg/issues/15146> > > >>> - Design Document: Efficient Column Updates in Iceberg > > >>> < > https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0 > > > > >>> > > >>> Next Steps: > > >>> I plan to create POCs to benchmark the approaches described in the > > >>> document. > > >>> > > >>> Please review the proposal and share your feedback. > > >>> > > >>> Thanks, > > >>> Anurag > > >>> > > >> > > >
