I had a chance to see the proposal before it landed and I think it is a
cool idea and both presented approaches would likely work. I am looking
forward to discussing the tradeoffs and would encourage everyone to
push/polish each approach to see what issues can be mitigated and what are
fundamental.

[1] Iceberg-native approach: better visibility into column files from the
metadata, potentially better concurrency for non-overlapping column
updates, no dep on Parquet.
[2] Parquet-native approach: almost no changes to the table format metadata
beyond tracking of base files.

I think [1] sounds a bit better on paper but I am worried about the
complexity in writers and readers (especially around keeping row groups
aligned and split planning). It would be great to cover this in detail in
the proposal.

пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
[email protected]> пише:

> Hi all,
>
> "Wide tables" with thousands of columns present significant challenges for
> AI/ML workloads, particularly when only a subset of columns needs to be
> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
> operations in Iceberg apply at the row level, which leads to substantial
> write amplification in scenarios such as:
>
>    - Feature Backfilling & Column Updates: Adding new feature columns
>    (e.g., model embeddings) to petabyte-scale tables.
>    - Model Score Updates: Refresh prediction scores after retraining.
>    - Embedding Refresh: Updating vector embeddings, which currently
>    triggers a rewrite of the entire row.
>    - Incremental Feature Computation: Daily updates to a small fraction
>    of features in wide tables.
>
> With the Iceberg V4 proposal introducing single-file commits and column
> stats improvements, this is an ideal time to address column-level updates
> to better support these use cases.
>
> I have drafted a proposal that explores both table-format enhancements and
> file-format (Parquet) changes to enable more efficient updates.
>
> Proposal Details:
> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146>
> - Design Document: Efficient Column Updates in Iceberg
> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
>
> Next Steps:
> I plan to create POCs to benchmark the approaches described in the
> document.
>
> Please review the proposal and share your feedback.
>
> Thanks,
> Anurag
>

Reply via email to