Hi all, "Wide tables" with thousands of columns present significant challenges for AI/ML workloads, particularly when only a subset of columns needs to be added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) operations in Iceberg apply at the row level, which leads to substantial write amplification in scenarios such as:
- Feature Backfilling & Column Updates: Adding new feature columns (e.g., model embeddings) to petabyte-scale tables. - Model Score Updates: Refresh prediction scores after retraining. - Embedding Refresh: Updating vector embeddings, which currently triggers a rewrite of the entire row. - Incremental Feature Computation: Daily updates to a small fraction of features in wide tables. With the Iceberg V4 proposal introducing single-file commits and column stats improvements, this is an ideal time to address column-level updates to better support these use cases. I have drafted a proposal that explores both table-format enhancements and file-format (Parquet) changes to enable more efficient updates. Proposal Details: - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146> - Design Document: Efficient Column Updates in Iceberg <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> Next Steps: I plan to create POCs to benchmark the approaches described in the document. Please review the proposal and share your feedback. Thanks, Anurag
