I had a chance to see the proposal before it landed and I think it is a cool idea and both presented approaches would likely work. I am looking forward to discussing the tradeoffs and would encourage everyone to push/polish each approach to see what issues can be mitigated and what are fundamental.
[1] Iceberg-native approach: better visibility into column files from the metadata, potentially better concurrency for non-overlapping column updates, no dep on Parquet. [2] Parquet-native approach: almost no changes to the table format metadata beyond tracking of base files. I think [1] sounds a bit better on paper but I am worried about the complexity in writers and readers (especially around keeping row groups aligned and split planning). It would be great to cover this in detail in the proposal. пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < [email protected]> пише: > Hi all, > > "Wide tables" with thousands of columns present significant challenges for > AI/ML workloads, particularly when only a subset of columns needs to be > added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) > operations in Iceberg apply at the row level, which leads to substantial > write amplification in scenarios such as: > > - Feature Backfilling & Column Updates: Adding new feature columns > (e.g., model embeddings) to petabyte-scale tables. > - Model Score Updates: Refresh prediction scores after retraining. > - Embedding Refresh: Updating vector embeddings, which currently > triggers a rewrite of the entire row. > - Incremental Feature Computation: Daily updates to a small fraction > of features in wide tables. > > With the Iceberg V4 proposal introducing single-file commits and column > stats improvements, this is an ideal time to address column-level updates > to better support these use cases. > > I have drafted a proposal that explores both table-format enhancements and > file-format (Parquet) changes to enable more efficient updates. > > Proposal Details: > - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146> > - Design Document: Efficient Column Updates in Iceberg > <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> > > Next Steps: > I plan to create POCs to benchmark the approaches described in the > document. > > Please review the proposal and share your feedback. > > Thanks, > Anurag >
