Hi Gang,

Thanks for the pointers. I reviewed Peter's column family design and the
related dev list discussions while researching this proposal.

The current design differs in two ways:

   - It is built on top of V4 metadata structures.
   - It generalizes the column family approach, which requires pre-planning
   how columns are assigned to specific families.

The reader implementations have significant overlap, particularly regarding
row alignment and positional stitching. I have already reached out to Peter
to collaborate on this proposal.

~ Anurag

On Mon, Jan 26, 2026 at 6:08 PM Gang Wu <[email protected]> wrote:

> I remember that Peter has initialized a relevant discussion in [1] and
> spent some time on the design and benchmark of a similar approach
> (introducing column families).
>
> Perhaps there is an opportunity to join the effort?
>
> [1] https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
>
> On Tue, Jan 27, 2026 at 3:10 AM Anton Okolnychyi <[email protected]>
> wrote:
> >
> > I had a chance to see the proposal before it landed and I think it is a
> cool idea and both presented approaches would likely work. I am looking
> forward to discussing the tradeoffs and would encourage everyone to
> push/polish each approach to see what issues can be mitigated and what are
> fundamental.
> >
> > [1] Iceberg-native approach: better visibility into column files from
> the metadata, potentially better concurrency for non-overlapping column
> updates, no dep on Parquet.
> > [2] Parquet-native approach: almost no changes to the table format
> metadata beyond tracking of base files.
> >
> > I think [1] sounds a bit better on paper but I am worried about the
> complexity in writers and readers (especially around keeping row groups
> aligned and split planning). It would be great to cover this in detail in
> the proposal.
> >
> > пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
> [email protected]> пише:
> >>
> >> Hi all,
> >>
> >> "Wide tables" with thousands of columns present significant challenges
> for AI/ML workloads, particularly when only a subset of columns needs to be
> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
> operations in Iceberg apply at the row level, which leads to substantial
> write amplification in scenarios such as:
> >>
> >> Feature Backfilling & Column Updates: Adding new feature columns (e.g.,
> model embeddings) to petabyte-scale tables.
> >> Model Score Updates: Refresh prediction scores after retraining.
> >> Embedding Refresh: Updating vector embeddings, which currently triggers
> a rewrite of the entire row.
> >> Incremental Feature Computation: Daily updates to a small fraction of
> features in wide tables.
> >>
> >> With the Iceberg V4 proposal introducing single-file commits and column
> stats improvements, this is an ideal time to address column-level updates
> to better support these use cases.
> >>
> >> I have drafted a proposal that explores both table-format enhancements
> and file-format (Parquet) changes to enable more efficient updates.
> >>
> >> Proposal Details:
> >> - GitHub Issue: #15146
> >> - Design Document: Efficient Column Updates in Iceberg
> >>
> >> Next Steps:
> >> I plan to create POCs to benchmark the approaches described in the
> document.
> >>
> >> Please review the proposal and share your feedback.
> >>
> >> Thanks,
> >> Anurag
>

Reply via email to