Hi Anurag,

It’s great to see how much interest there is in the community around this
potential new feature. Gábor and I have actually submitted an Iceberg
Summit talk proposal on this topic, and we would be very happy to
collaborate on the work. I was mainly waiting for the File Format API to be
finalized, as I believe this feature should build on top of it.

For reference, our related work includes:

   - *Dev list thread:*
   https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
   - *Proposal document:*
   
https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
   (not shared widely yet)
   - *Performance testing PR for readers and writers:*
   https://github.com/apache/iceberg/pull/13306

During earlier discussions about possible metadata changes, another option
came up that hasn’t been documented yet: separating planner metadata from
reader metadata. Since the planner does not need to know about the actual
files, we could store the file composition in a separate file (potentially
a Puffin file). This file could hold the column_files metadata, while the
manifest would reference the Puffin file and blob position instead of the
data filename.
This approach has the advantage of keeping the existing metadata largely
intact, and it could also give us a natural place later to add file-level
indexes or Bloom filters for use during reads or secondary filtering. The
downsides are the additional files and the increased complexity of
identifying files that are no longer referenced by the table, so this may
not be an ideal solution.

I do have some concerns about the MoR metadata proposal described in the
document. At first glance, it seems to complicate distributed planning, as
all entries for a given file would need to be collected and merged to
provide the information required by both the planner and the reader.
Additionally, when a new column is added or updated, we would still need to
add a new metadata entry for every existing data file. If we immediately
write out the merged metadata, the total number of entries remains the
same. The main benefit is avoiding rewriting statistics, which can be
significant, but this comes at the cost of increased planning complexity.
If we choose to store the merged statistics in the column_families entry, I
don’t see much benefit in excluding the rest of the metadata, especially
since including it would simplify the planning process.

As Anton already pointed out, we should also discuss how this change would
affect split handling, particularly how to avoid double reads when row
groups are not aligned between the original data files and the new column
files.

Finally, I’d like to see some discussion around the Java API implications.
In particular, what API changes are required, and how SQL engines would
perform updates. Since the new column files must have the same number of
rows as the original data files, with a strict one-to-one relationship, SQL
engines would need access to the source filename, position, and deletion
status in the DataFrame in order to generate the new files. This is more
involved than a simple update and deserves some explicit consideration.

Looking forward to your thoughts.
Best regards,
Peter

On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <[email protected]>
wrote:

> Thanks Anton and others, for providing some initial feedback. I will
> address all your comments soon.
>
> On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <[email protected]>
> wrote:
>
>> I had a chance to see the proposal before it landed and I think it is a
>> cool idea and both presented approaches would likely work. I am looking
>> forward to discussing the tradeoffs and would encourage everyone to
>> push/polish each approach to see what issues can be mitigated and what are
>> fundamental.
>>
>> [1] Iceberg-native approach: better visibility into column files from the
>> metadata, potentially better concurrency for non-overlapping column
>> updates, no dep on Parquet.
>> [2] Parquet-native approach: almost no changes to the table format
>> metadata beyond tracking of base files.
>>
>> I think [1] sounds a bit better on paper but I am worried about the
>> complexity in writers and readers (especially around keeping row groups
>> aligned and split planning). It would be great to cover this in detail in
>> the proposal.
>>
>> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
>> [email protected]> пише:
>>
>>> Hi all,
>>>
>>> "Wide tables" with thousands of columns present significant challenges
>>> for AI/ML workloads, particularly when only a subset of columns needs to be
>>> added or updated. Current Copy-on-Write (COW) and Merge-on-Read (MOR)
>>> operations in Iceberg apply at the row level, which leads to substantial
>>> write amplification in scenarios such as:
>>>
>>>    - Feature Backfilling & Column Updates: Adding new feature columns
>>>    (e.g., model embeddings) to petabyte-scale tables.
>>>    - Model Score Updates: Refresh prediction scores after retraining.
>>>    - Embedding Refresh: Updating vector embeddings, which currently
>>>    triggers a rewrite of the entire row.
>>>    - Incremental Feature Computation: Daily updates to a small fraction
>>>    of features in wide tables.
>>>
>>> With the Iceberg V4 proposal introducing single-file commits and column
>>> stats improvements, this is an ideal time to address column-level updates
>>> to better support these use cases.
>>>
>>> I have drafted a proposal that explores both table-format enhancements
>>> and file-format (Parquet) changes to enable more efficient updates.
>>>
>>> Proposal Details:
>>> - GitHub Issue: #15146 <https://github.com/apache/iceberg/issues/15146>
>>> - Design Document: Efficient Column Updates in Iceberg
>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
>>>
>>> Next Steps:
>>> I plan to create POCs to benchmark the approaches described in the
>>> document.
>>>
>>> Please review the proposal and share your feedback.
>>>
>>> Thanks,
>>> Anurag
>>>
>>

Reply via email to