I remember that Peter has initialized a relevant discussion in [1] and spent some time on the design and benchmark of a similar approach (introducing column families).
Perhaps there is an opportunity to join the effort? [1] https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 On Tue, Jan 27, 2026 at 3:10 AM Anton Okolnychyi <[email protected]> wrote: > > I had a chance to see the proposal before it landed and I think it is a cool > idea and both presented approaches would likely work. I am looking forward to > discussing the tradeoffs and would encourage everyone to push/polish each > approach to see what issues can be mitigated and what are fundamental. > > [1] Iceberg-native approach: better visibility into column files from the > metadata, potentially better concurrency for non-overlapping column updates, > no dep on Parquet. > [2] Parquet-native approach: almost no changes to the table format metadata > beyond tracking of base files. > > I think [1] sounds a bit better on paper but I am worried about the > complexity in writers and readers (especially around keeping row groups > aligned and split planning). It would be great to cover this in detail in the > proposal. > > пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada > <[email protected]> пише: >> >> Hi all, >> >> "Wide tables" with thousands of columns present significant challenges for >> AI/ML workloads, particularly when only a subset of columns needs to be >> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) >> operations in Iceberg apply at the row level, which leads to substantial >> write amplification in scenarios such as: >> >> Feature Backfilling & Column Updates: Adding new feature columns (e.g., >> model embeddings) to petabyte-scale tables. >> Model Score Updates: Refresh prediction scores after retraining. >> Embedding Refresh: Updating vector embeddings, which currently triggers a >> rewrite of the entire row. >> Incremental Feature Computation: Daily updates to a small fraction of >> features in wide tables. >> >> With the Iceberg V4 proposal introducing single-file commits and column >> stats improvements, this is an ideal time to address column-level updates to >> better support these use cases. >> >> I have drafted a proposal that explores both table-format enhancements and >> file-format (Parquet) changes to enable more efficient updates. >> >> Proposal Details: >> - GitHub Issue: #15146 >> - Design Document: Efficient Column Updates in Iceberg >> >> Next Steps: >> I plan to create POCs to benchmark the approaches described in the document. >> >> Please review the proposal and share your feedback. >> >> Thanks, >> Anurag
