Re: [Discuss] Efficient column updates in Iceberg

Anurag Mantripragada Mon, 26 Jan 2026 18:58:39 -0800

Thanks Anton and others, for providing some initial feedback. I will
address all your comments soon.


On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]>
wrote:

> I had a chance to see the proposal before it landed and I think it is a
> cool idea and both presented approaches would likely work. I am looking
> forward to discussing the tradeoffs and would encourage everyone to
> push/polish each approach to see what issues can be mitigated and what are
> fundamental.
>
> [1] Iceberg-native approach: better visibility into column files from the
> metadata, potentially better concurrency for non-overlapping column
> updates, no dep on Parquet.
> [2] Parquet-native approach: almost no changes to the table format
> metadata beyond tracking of base files.
>
> I think [1] sounds a bit better on paper but I am worried about the
> complexity in writers and readers (especially around keeping row groups
> aligned and split planning). It would be great to cover this in detail in
> the proposal.
>
> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
> [email protected]> пише:
>
>> Hi all,
>>
>> "Wide tables" with thousands of columns present significant challenges
>> for AI/ML workloads, particularly when only a subset of columns needs to be
>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
>> operations in Iceberg apply at the row level, which leads to substantial
>> write amplification in scenarios such as:
>>
>>    - Feature Backfilling & Column Updates: Adding new feature columns
>>    (e.g., model embeddings) to petabyte-scale tables.
>>    - Model Score Updates: Refresh prediction scores after retraining.
>>    - Embedding Refresh: Updating vector embeddings, which currently
>>    triggers a rewrite of the entire row.
>>    - Incremental Feature Computation: Daily updates to a small fraction
>>    of features in wide tables.
>>
>> With the Iceberg V4 proposal introducing single-file commits and column
>> stats improvements, this is an ideal time to address column-level updates
>> to better support these use cases.
>>
>> I have drafted a proposal that explores both table-format enhancements
>> and file-format (Parquet) changes to enable more efficient updates.
>>
>> Proposal Details:
>> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146>
>> - Design Document: Efficient Column Updates in Iceberg
>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
>>
>> Next Steps:
>> I plan to create POCs to benchmark the approaches described in the
>> document.
>>
>> Please review the proposal and share your feedback.
>>
>> Thanks,
>> Anurag
>>
>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to