Thanks Anton and others, for providing some initial feedback. I will address all your comments soon.
On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]> wrote: > I had a chance to see the proposal before it landed and I think it is a > cool idea and both presented approaches would likely work. I am looking > forward to discussing the tradeoffs and would encourage everyone to > push/polish each approach to see what issues can be mitigated and what are > fundamental. > > [1] Iceberg-native approach: better visibility into column files from the > metadata, potentially better concurrency for non-overlapping column > updates, no dep on Parquet. > [2] Parquet-native approach: almost no changes to the table format > metadata beyond tracking of base files. > > I think [1] sounds a bit better on paper but I am worried about the > complexity in writers and readers (especially around keeping row groups > aligned and split planning). It would be great to cover this in detail in > the proposal. > > пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < > [email protected]> пише: > >> Hi all, >> >> "Wide tables" with thousands of columns present significant challenges >> for AI/ML workloads, particularly when only a subset of columns needs to be >> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) >> operations in Iceberg apply at the row level, which leads to substantial >> write amplification in scenarios such as: >> >> - Feature Backfilling & Column Updates: Adding new feature columns >> (e.g., model embeddings) to petabyte-scale tables. >> - Model Score Updates: Refresh prediction scores after retraining. >> - Embedding Refresh: Updating vector embeddings, which currently >> triggers a rewrite of the entire row. >> - Incremental Feature Computation: Daily updates to a small fraction >> of features in wide tables. >> >> With the Iceberg V4 proposal introducing single-file commits and column >> stats improvements, this is an ideal time to address column-level updates >> to better support these use cases. >> >> I have drafted a proposal that explores both table-format enhancements >> and file-format (Parquet) changes to enable more efficient updates. >> >> Proposal Details: >> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146> >> - Design Document: Efficient Column Updates in Iceberg >> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> >> >> Next Steps: >> I plan to create POCs to benchmark the approaches described in the >> document. >> >> Please review the proposal and share your feedback. >> >> Thanks, >> Anurag >> >
