Hi Anurag, It’s great to see how much interest there is in the community around this potential new feature. Gábor and I have actually submitted an Iceberg Summit talk proposal on this topic, and we would be very happy to collaborate on the work. I was mainly waiting for the File Format API to be finalized, as I believe this feature should build on top of it.
For reference, our related work includes: - *Dev list thread:* https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 - *Proposal document:* https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww (not shared widely yet) - *Performance testing PR for readers and writers:* https://github.com/apache/iceberg/pull/13306 During earlier discussions about possible metadata changes, another option came up that hasn’t been documented yet: separating planner metadata from reader metadata. Since the planner does not need to know about the actual files, we could store the file composition in a separate file (potentially a Puffin file). This file could hold the column_files metadata, while the manifest would reference the Puffin file and blob position instead of the data filename. This approach has the advantage of keeping the existing metadata largely intact, and it could also give us a natural place later to add file-level indexes or Bloom filters for use during reads or secondary filtering. The downsides are the additional files and the increased complexity of identifying files that are no longer referenced by the table, so this may not be an ideal solution. I do have some concerns about the MoR metadata proposal described in the document. At first glance, it seems to complicate distributed planning, as all entries for a given file would need to be collected and merged to provide the information required by both the planner and the reader. Additionally, when a new column is added or updated, we would still need to add a new metadata entry for every existing data file. If we immediately write out the merged metadata, the total number of entries remains the same. The main benefit is avoiding rewriting statistics, which can be significant, but this comes at the cost of increased planning complexity. If we choose to store the merged statistics in the column_families entry, I don’t see much benefit in excluding the rest of the metadata, especially since including it would simplify the planning process. As Anton already pointed out, we should also discuss how this change would affect split handling, particularly how to avoid double reads when row groups are not aligned between the original data files and the new column files. Finally, I’d like to see some discussion around the Java API implications. In particular, what API changes are required, and how SQL engines would perform updates. Since the new column files must have the same number of rows as the original data files, with a strict one-to-one relationship, SQL engines would need access to the source filename, position, and deletion status in the DataFrame in order to generate the new files. This is more involved than a simple update and deserves some explicit consideration. Looking forward to your thoughts. Best regards, Peter On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <[email protected]> wrote: > Thanks Anton and others, for providing some initial feedback. I will > address all your comments soon. > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]> > wrote: > >> I had a chance to see the proposal before it landed and I think it is a >> cool idea and both presented approaches would likely work. I am looking >> forward to discussing the tradeoffs and would encourage everyone to >> push/polish each approach to see what issues can be mitigated and what are >> fundamental. >> >> [1] Iceberg-native approach: better visibility into column files from the >> metadata, potentially better concurrency for non-overlapping column >> updates, no dep on Parquet. >> [2] Parquet-native approach: almost no changes to the table format >> metadata beyond tracking of base files. >> >> I think [1] sounds a bit better on paper but I am worried about the >> complexity in writers and readers (especially around keeping row groups >> aligned and split planning). It would be great to cover this in detail in >> the proposal. >> >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < >> [email protected]> пише: >> >>> Hi all, >>> >>> "Wide tables" with thousands of columns present significant challenges >>> for AI/ML workloads, particularly when only a subset of columns needs to be >>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) >>> operations in Iceberg apply at the row level, which leads to substantial >>> write amplification in scenarios such as: >>> >>> - Feature Backfilling & Column Updates: Adding new feature columns >>> (e.g., model embeddings) to petabyte-scale tables. >>> - Model Score Updates: Refresh prediction scores after retraining. >>> - Embedding Refresh: Updating vector embeddings, which currently >>> triggers a rewrite of the entire row. >>> - Incremental Feature Computation: Daily updates to a small fraction >>> of features in wide tables. >>> >>> With the Iceberg V4 proposal introducing single-file commits and column >>> stats improvements, this is an ideal time to address column-level updates >>> to better support these use cases. >>> >>> I have drafted a proposal that explores both table-format enhancements >>> and file-format (Parquet) changes to enable more efficient updates. >>> >>> Proposal Details: >>> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146> >>> - Design Document: Efficient Column Updates in Iceberg >>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> >>> >>> Next Steps: >>> I plan to create POCs to benchmark the approaches described in the >>> document. >>> >>> Please review the proposal and share your feedback. >>> >>> Thanks, >>> Anurag >>> >>
