I remember that Peter has initialized a relevant discussion in [1] and
spent some time on the design and benchmark of a similar approach
(introducing column families).

Perhaps there is an opportunity to join the effort?

[1] https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9

On Tue, Jan 27, 2026 at 3:10 AM Anton Okolnychyi <[email protected]> wrote:
>
> I had a chance to see the proposal before it landed and I think it is a cool 
> idea and both presented approaches would likely work. I am looking forward to 
> discussing the tradeoffs and would encourage everyone to push/polish each 
> approach to see what issues can be mitigated and what are fundamental.
>
> [1] Iceberg-native approach: better visibility into column files from the 
> metadata, potentially better concurrency for non-overlapping column updates, 
> no dep on Parquet.
> [2] Parquet-native approach: almost no changes to the table format metadata 
> beyond tracking of base files.
>
> I think [1] sounds a bit better on paper but I am worried about the 
> complexity in writers and readers (especially around keeping row groups 
> aligned and split planning). It would be great to cover this in detail in the 
> proposal.
>
> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada 
> <[email protected]> пише:
>>
>> Hi all,
>>
>> "Wide tables" with thousands of columns present significant challenges for 
>> AI/ML workloads, particularly when only a subset of columns needs to be 
>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) 
>> operations in Iceberg apply at the row level, which leads to substantial 
>> write amplification in scenarios such as:
>>
>> Feature Backfilling & Column Updates: Adding new feature columns (e.g., 
>> model embeddings) to petabyte-scale tables.
>> Model Score Updates: Refresh prediction scores after retraining.
>> Embedding Refresh: Updating vector embeddings, which currently triggers a 
>> rewrite of the entire row.
>> Incremental Feature Computation: Daily updates to a small fraction of 
>> features in wide tables.
>>
>> With the Iceberg V4 proposal introducing single-file commits and column 
>> stats improvements, this is an ideal time to address column-level updates to 
>> better support these use cases.
>>
>> I have drafted a proposal that explores both table-format enhancements and 
>> file-format (Parquet) changes to enable more efficient updates.
>>
>> Proposal Details:
>> - GitHub Issue: #15146
>> - Design Document: Efficient Column Updates in Iceberg
>>
>> Next Steps:
>> I plan to create POCs to benchmark the approaches described in the document.
>>
>> Please review the proposal and share your feedback.
>>
>> Thanks,
>> Anurag

Reply via email to