Hi Gang, Thanks for the pointers. I reviewed Peter's column family design and the related dev list discussions while researching this proposal.
The current design differs in two ways: - It is built on top of V4 metadata structures. - It generalizes the column family approach, which requires pre-planning how columns are assigned to specific families. The reader implementations have significant overlap, particularly regarding row alignment and positional stitching. I have already reached out to Peter to collaborate on this proposal. ~ Anurag On Mon, Jan 26, 2026 at 6:08 PM Gang Wu <[email protected]> wrote: > I remember that Peter has initialized a relevant discussion in [1] and > spent some time on the design and benchmark of a similar approach > (introducing column families). > > Perhaps there is an opportunity to join the effort? > > [1] https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 > > On Tue, Jan 27, 2026 at 3:10 AM Anton Okolnychyi <[email protected]> > wrote: > > > > I had a chance to see the proposal before it landed and I think it is a > cool idea and both presented approaches would likely work. I am looking > forward to discussing the tradeoffs and would encourage everyone to > push/polish each approach to see what issues can be mitigated and what are > fundamental. > > > > [1] Iceberg-native approach: better visibility into column files from > the metadata, potentially better concurrency for non-overlapping column > updates, no dep on Parquet. > > [2] Parquet-native approach: almost no changes to the table format > metadata beyond tracking of base files. > > > > I think [1] sounds a bit better on paper but I am worried about the > complexity in writers and readers (especially around keeping row groups > aligned and split planning). It would be great to cover this in detail in > the proposal. > > > > пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada < > [email protected]> пише: > >> > >> Hi all, > >> > >> "Wide tables" with thousands of columns present significant challenges > for AI/ML workloads, particularly when only a subset of columns needs to be > added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR) > operations in Iceberg apply at the row level, which leads to substantial > write amplification in scenarios such as: > >> > >> Feature Backfilling & Column Updates: Adding new feature columns (e.g., > model embeddings) to petabyte-scale tables. > >> Model Score Updates: Refresh prediction scores after retraining. > >> Embedding Refresh: Updating vector embeddings, which currently triggers > a rewrite of the entire row. > >> Incremental Feature Computation: Daily updates to a small fraction of > features in wide tables. > >> > >> With the Iceberg V4 proposal introducing single-file commits and column > stats improvements, this is an ideal time to address column-level updates > to better support these use cases. > >> > >> I have drafted a proposal that explores both table-format enhancements > and file-format (Parquet) changes to enable more efficient updates. > >> > >> Proposal Details: > >> - GitHub Issue: #15146 > >> - Design Document: Efficient Column Updates in Iceberg > >> > >> Next Steps: > >> I plan to create POCs to benchmark the approaches described in the > document. > >> > >> Please review the proposal and share your feedback. > >> > >> Thanks, > >> Anurag >
