Hi all,

"Wide tables" with thousands of columns present significant challenges for
AI/ML workloads, particularly when only a subset of columns needs to be
added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
operations in Iceberg apply at the row level, which leads to substantial
write amplification in scenarios such as:

   - Feature Backfilling & Column Updates: Adding new feature columns
   (e.g., model embeddings) to petabyte-scale tables.
   - Model Score Updates: Refresh prediction scores after retraining.
   - Embedding Refresh: Updating vector embeddings, which currently
   triggers a rewrite of the entire row.
   - Incremental Feature Computation: Daily updates to a small fraction of
   features in wide tables.

With the Iceberg V4 proposal introducing single-file commits and column
stats improvements, this is an ideal time to address column-level updates
to better support these use cases.

I have drafted a proposal that explores both table-format enhancements and
file-format (Parquet) changes to enable more efficient updates.

Proposal Details:
- GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146>
- Design Document: Efficient Column Updates in Iceberg
<https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>

Next Steps:
I plan to create POCs to benchmark the approaches described in the document.

Please review the proposal and share your feedback.

Thanks,
Anurag

Reply via email to