Re: [Discuss] Efficient column updates in Iceberg

Micah Kornfield Tue, 03 Mar 2026 15:32:19 -0800

>
> If this is correct, it aligns well with the current proposal and shouldn't
> introduce any additional complexity. I will add it to the discussion points
> for tomorrow's community sync.



Yes, this example aligns with what I was thinking (nit: "range" probably
wouldn't be a string but I assume this was just for illustrative purposes)

On the other hand, in the column family use case, splitting columns is a
> strict requirement for performance. I haven’t considered how this would
> work, but perhaps we could introduce a table property for column families
> to make this explicit, and compaction jobs would have to respect


Yeah, I don't want to get into the exact mechanics for column families. I
was just calling out that compaction to the base file is not desirable in
all cases, so shouldn't be assumed as a solution for small files.

Thanks,
Micah



On Tue, Mar 3, 2026 at 3:11 PM Anurag Mantripragada <
[email protected]> wrote:

> Hi Micah,
>
> Could you expand on the complexity you think this introduces (or more
>> specifically "significant" part)?
>
> I may have misunderstood your approach regarding packing row ranges. To
> clarify, is the following what you had in mind?
>
> Initially, we have base_file_1.parquet (rows 1-1000) and
> base_file_2.parquet (rows 1001-2000). If we update the "score" column
> across both files and pack those updates into a single larger file,
> packed_col_A.parquet, would the metadata structure look like this?
>
>   {    "data_file_path": "base_file_1.parquet",    "column_updates": [      { 
>        "field_id": 12,        "update_file_path": "packed_col_A.parquet",     
>    "row_range": "0-1000"      }    ]  },  {    "data_file_path": 
> "base_file_2.parquet",    "column_updates": [     {        "field_id": 12,    
>     "update_file_path": "packed_col_A.parquet",        "row_range": 
> "1001-2000"    }    ]  }
>
>
> If this is correct, it aligns well with the current proposal and shouldn't
> introduce any additional complexity. I will add it to the discussion points
> for tomorrow's community sync.
>
>
> This seems at odds with supporting column families in the future?
>
> In my opinion, there’s a distinction between the use cases of column
> updates and column families. Column updates are designed for fast writes
> while maintaining reasonable read performance. Compaction is desirable to
> reduce the complexity of the read side, if any. On the other hand, in the
> column family use case, splitting columns is a strict requirement for
> performance. I haven’t considered how this would work, but perhaps we could
> introduce a table property for column families to make this explicit, and
> compaction jobs would have to respect it.
>
> ~Anurag
>
> On Tue, Mar 3, 2026 at 12:02 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Anurag,
>>
>>> *Compaction and small files*: If I understand the row ranges idea
>>> correctly, packing multiple updates into larger column files would require
>>> matching ranges to base files based on predicates, which adds significant
>>> planning complexity. Regular compaction, which rewrites column files into
>>> the base file seems more practical.
>>
>>
>> Could you expand on the complexity you think this introduces (or more
>> specifically "significant" part)? In this case the predicate should be
>> pretty simple (i.e. read rows between X and Y only) and can be done
>> efficiently via row group statistics.  Smart writers could even partition
>> rows for a specific base file into their own row group/pages to make the
>> filter trivial.
>>
>> Regular compaction, which rewrites column files into the base file seems
>>> more practical.
>>
>>
>> This seems at odds with supporting column families in the future?
>>
>> Thanks,
>> Micah
>>
>>
>> On Tue, Mar 3, 2026 at 11:43 AM Anurag Mantripragada <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> Sorry for the delayed response. I was on vacation and catching up.
>>> Thanks for the continued discussion on this topic.
>>>
>>> *Partial updates*: I agree that MoR-style row-level updates offer
>>> limited benefits beyond reducing the writing of irrelevant columns. For use
>>> cases like updating a subset of users, existing deletion vectors and the
>>> new V4 manifest delete vectors should perform well. Gabor’s suggestion for
>>> file-level partial updates is a reasonable alternative, even with some
>>> write amplification.
>>>
>>> *Compaction and small files*: If I understand the row ranges idea
>>> correctly, packing multiple updates into larger column files would require
>>> matching ranges to base files based on predicates, which adds significant
>>> planning complexity. Regular compaction, which rewrites column files into
>>> the base file seems more practical.
>>>
>>> *Column families*: While splitting columns into families is useful, the
>>> current design is more generic and already supports packing families into
>>> column files. Deciding how to group these columns (manually or via an
>>> engine) can be addressed in separate follow-up work.
>>>
>>> *Next steps:*
>>>
>>>    - Gabor and I are developing a POC for metadata changes, focusing on
>>>    reading and writing column files using Spark for integration. We will 
>>> share
>>>    more details soon.
>>>    - I will update the doc in preparation for tomorrow's sync.
>>>
>>>
>>> As a reminder we have a sync on column updates upcoming
>>>
>>> Efficient column updates sync
>>> Wednesday, March 4 · 9:00 – 10:00am
>>> Time zone: America/Los_Angeles
>>> Google Meet joining info
>>> Video call link: https://meet.google.com/naf-tvvn-qup
>>>
>>> ~ Anurag
>>>
>>> On Wed, Feb 25, 2026 at 1:32 PM Gábor Kaszab <[email protected]>
>>> wrote:
>>>
>>>> Hey All,
>>>>
>>>> Nice to see the activity on this thread. Thanks to everyone who chimed
>>>> in!
>>>>
>>>> Micah, I also feel that 1) (full column updates) and 2) (partial but
>>>> file-level column updates) could be a good middle ground between perf
>>>> improvement and keeping the code complexity low. In fact I had the chance
>>>> to experiment in this area and the metadata + API part would be as simple
>>>> as in this PoC <https://github.com/apache/iceberg/pull/15445>. Just a
>>>> side note for 3), from the SQL aspect I'm a bit hesitant how
>>>> straightforward it is for the users to write predicates that align with
>>>> file boundaries, though.
>>>> For deciding on partial column updates, we probably can't get away
>>>> without doing some measurements of how it compares to existing MoR. I have
>>>> it on my roadmap, so I'll share it once I have something.
>>>>
>>>> Wrapping multiple update files into one is an interesting idea. Let's
>>>> bring this up on the next sync! Additionally, full column updates could add
>>>> a huge overhead on the metadata files being created too (delete everything
>>>> + write everything with updates), unless we decide to do some manifest
>>>> rewrites/optimizations under the hood during the commit.
>>>>
>>>> Peter, column families as a schema-like table metadata level
>>>> information would definitely be useful. It seems like a natural follow-up
>>>> of the column update work, but we have to keep in mind to choose a design
>>>> that won't prevent us from implementing a more general column families
>>>> concept (probably for inserts too).
>>>>
>>>> Best Regards,
>>>> Gabor
>>>>
>>>> Micah Kornfield <[email protected]> ezt írta (időpont: 2026. febr.
>>>> 21., Szo, 17:53):
>>>>
>>>>> 1) and 3) are what I was thinking of as use-cases.  I agree unless
>>>>> there is a strong motivating use-case for MoR style column updates we
>>>>> should try to avoid this complexity and use the existing row based MoR.
>>>>>
>>>>> One other idea I was trying to think through is the "small file
>>>>> problem" we would likely encounter for single column additions/updates for
>>>>> fixed width data.  Would it make sense to add a record-range into the
>>>>> metadata for column families, so that we can pack column updates across
>>>>> files into reasonably sized files (similar to what we do for DVs today in
>>>>> puffin files)?
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Mon, Feb 16, 2026 at 7:23 AM Gábor Kaszab <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey All,
>>>>>>
>>>>>> Thanks Anurag for the summary!
>>>>>>
>>>>>> I regret we don't have a recording for the sync, but I had the
>>>>>> impression that, even though there was a lengthy discussion about the
>>>>>> implementation requirements for partial updates, there wasn't a strong
>>>>>> consensus around the need and there were no strong use cases to justify
>>>>>> partial updates either. Let me sum up where I see we are at now:
>>>>>>
>>>>>> *Scope of the updates*
>>>>>>
>>>>>> *1) Full column updates*
>>>>>> There is a consensus and common understanding that this use case
>>>>>> makes sense. If this was the only supported use-case, the implementation
>>>>>> would be relatively simple. We could guarantee there is no overlap in
>>>>>> column updates by deduplicating the field IDs in the column update
>>>>>> metadata. E.g. Let's say we have a column update on columns {1,2} and we
>>>>>> write another column update for {2,3}: we can change the metadata for the
>>>>>> first one to only cover {1} and not {1,2}. With this the write and the
>>>>>> read/stitching process is also straightforward (if we decide not to 
>>>>>> support
>>>>>> equality deletes together with column updates).
>>>>>>
>>>>>> Both row matching approaches could work here:
>>>>>>     - row number matching update files, where we fill the deleted
>>>>>> rows with an arbitrary value (preferably null)
>>>>>>     - sparse update files with some auxiliary column written into the
>>>>>> column update file, like row position in base file
>>>>>>
>>>>>> *2) Partial column updates (row-level)*
>>>>>> I see 2 use cases mentioned for this: bug-fixing a subset of rows,
>>>>>> updating features for active users
>>>>>> My initial impression here is that whether to use column updates or
>>>>>> not heavily depends on the selectivity of the partial update queries. I'm
>>>>>> sure there is a percentage of the affected rows where if we go below it's
>>>>>> simply better to use the traditional row level updates (cow/mor). I'm not
>>>>>> entirely convinced that covering these scenarios is worth the extra
>>>>>> complexity here:
>>>>>>     - We can't deduplicate the column updates by field IDs on the
>>>>>> metadata-side
>>>>>>     - We have two options for writers:
>>>>>>          - Merge the existing column update files themselves when
>>>>>> writing a new one with an overlap of field Ids. No need to sort out the
>>>>>> different column updates files and merge them on the read side, but there
>>>>>> is overhead on write side
>>>>>>         - Don't bother merging existing column updates when writing a
>>>>>> new one. This makes overhead on the read side.
>>>>>>
>>>>>> Handling of sparse update files is a must here, with the chance for
>>>>>> optimisation if all the rows are covered with the update file, as Micah
>>>>>> suggested.
>>>>>>
>>>>>> To sum up, I think to justify this approach we need to have strong
>>>>>> use-cases and measurements to verify that the extra complexity results
>>>>>> convincingly better results compared to existing CoW/MoR approaches.
>>>>>>
>>>>>> *3) Partial column updates (file-level)*
>>>>>> This option wasn't brought up during our conversation but might be
>>>>>> worth considering. This is basically a middleground between the above two
>>>>>> approaches. Partial updates are allowed as long as they affect entire 
>>>>>> data
>>>>>> files, and it's allowed to only cover a subset of the files. One use-case
>>>>>> would be to do column updates per partition for instance.
>>>>>>
>>>>>> With this approach the metadata representation could be as simple as
>>>>>> in 1), where we can deduplicate the updates files by field IDs. Also 
>>>>>> there
>>>>>> is no write and read overhead on top of 1) apart from the verification 
>>>>>> step
>>>>>> to ensure that the WHERE filter on the update is doing the split on file
>>>>>> boundaries.
>>>>>> Also similarly to 1), sparse update files weren't a must here, we
>>>>>> could consider row-matching update files too.
>>>>>>
>>>>>> *Row alignment*
>>>>>> Sparse update files are required for row-level partial updates, but
>>>>>> if we decide to go with any of the other options we could also evaluate 
>>>>>> the
>>>>>> "row count matching" approach too. Even though it requires filling the
>>>>>> missing rows with arbitrary values (null seems a good candidate) it would
>>>>>> result in less write overhead (no need to write row position) and read
>>>>>> overhead (no need to join rows by row position) too that could worth the
>>>>>> inconvenience of having 'invalid' but inaccessible values in the files. 
>>>>>> The
>>>>>> num nulls stats being off is a good argument against this, but I think we
>>>>>> could have a way of fixing this too by keeping track of how many rows 
>>>>>> were
>>>>>> deleted (and subtract this value from the num nulls counter returned by 
>>>>>> the
>>>>>> writer).
>>>>>>
>>>>>>
>>>>>> *Next steps*
>>>>>> I'm actively working on a very basic PoC implementation where we
>>>>>> would be able to test the different approaches comparing pros and cons so
>>>>>> that we can make a decision on the above questions. I'll sync with Anurag
>>>>>> on this and will let you know once we have something.
>>>>>>
>>>>>> Best Regards,
>>>>>> Gabor
>>>>>>
>>>>>>
>>>>>> Micah Kornfield <[email protected]> ezt írta (időpont: 2026.
>>>>>> febr. 14., Szo, 2:20):
>>>>>>
>>>>>>> Given that, the sparse representation with alignment at read time
>>>>>>>> (using dummy/null values) seems to provide the benefits of both 
>>>>>>>> efficient
>>>>>>>> vectorized reads and stitching as well as support for partial column
>>>>>>>> updates. Would you agree?
>>>>>>>
>>>>>>>
>>>>>>> Thinking more about it, I think the sparse approach is actually a
>>>>>>> superset set approach, so it is not a concern.  If writers want they can
>>>>>>> write out the fully populated columns with position indexes from 1 to N,
>>>>>>> and readers can take an optimized path if they detect the number of 
>>>>>>> rows in
>>>>>>> the update is equal to the number of base rows.
>>>>>>>
>>>>>>> I still think there is a question on what writers should do (i.e.
>>>>>>> when do they decide to duplicate data instead of trying to give sparse
>>>>>>> updates) but that is an implementation question and not necessarily
>>>>>>> something that needs to block spec work.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Micah
>>>>>>>
>>>>>>> On Fri, Feb 13, 2026 at 11:29 AM Anurag Mantripragada <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Micah,
>>>>>>>>
>>>>>>>> This seems like a classic MoR vs CoW trade-off.  But it seems like
>>>>>>>>> maybe both sparse and full should be available (I understand this adds
>>>>>>>>> complexity). For adding a new column or completely updating a new 
>>>>>>>>> column,
>>>>>>>>> the performance would be better to prefill the data
>>>>>>>>
>>>>>>>>
>>>>>>>> Our internal use cases are very similar to what you describe. We
>>>>>>>> primarily deal with full column updates. However, the feedback on the
>>>>>>>> proposal from the wider community indicated that partial updates (e.g.,
>>>>>>>> bug-fixing a subset of rows, updating features for active users) are 
>>>>>>>> also a
>>>>>>>> very common and critical use case.
>>>>>>>>
>>>>>>>> Is there evidence to say that partial column updates are more
>>>>>>>>> common in practice then full rewrites?
>>>>>>>>
>>>>>>>>
>>>>>>>> Personally, I don't have hard data on which use case is more common
>>>>>>>> in the wild, only that both appear to be important. I also agree that a
>>>>>>>> good long term solution should support both strategies. Given that, the
>>>>>>>> sparse representation with alignment at read time (using dummy/null
>>>>>>>> values) seems to provide the benefits of both efficient vectorized 
>>>>>>>> reads
>>>>>>>> and stitching as well as support for partial column updates. Would you
>>>>>>>> agree?
>>>>>>>>
>>>>>>>> ~ Anurag
>>>>>>>>
>>>>>>>> On Fri, Feb 13, 2026 at 9:33 AM Micah Kornfield <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Anurag,
>>>>>>>>>
>>>>>>>>>> Data Representation: Sparse column files are preferred for
>>>>>>>>>> compact representation and are better suited for partial column 
>>>>>>>>>> updates. We
>>>>>>>>>> can optimize sparse representation for vectorized reads by
>>>>>>>>>> filling in null or default values at read time for missing positions 
>>>>>>>>>> from
>>>>>>>>>> the base file, which avoids joins during reads.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This seems like a classic MoR vs CoW trade-off.  But it seems like
>>>>>>>>> maybe both sparse and full should be available (I understand this adds
>>>>>>>>> complexity).  For adding a new column or completely updating a new 
>>>>>>>>> column,
>>>>>>>>> the performance would be better to prefill the data (otherwise one 
>>>>>>>>> ends up
>>>>>>>>> duplicating the work that is already happening under the hood in 
>>>>>>>>> parquet).
>>>>>>>>>
>>>>>>>>> Is there evidence to say that partial column updates are more
>>>>>>>>> common in practice then full rewrites?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 12, 2026 at 3:32 AM Eduard Tudenhöfner <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Anurag,
>>>>>>>>>>
>>>>>>>>>> I wasn't able to make it to the sync but was hoping to watch the
>>>>>>>>>> recording afterwards.
>>>>>>>>>> I'm curious what the reasons were for discarding the
>>>>>>>>>> Parquet-native approach. Could you share a summary from what was 
>>>>>>>>>> discussed
>>>>>>>>>> in the sync please on that topic?
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 10, 2026 at 8:20 PM Anurag Mantripragada <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Thank you for attending today's sync. Please find the meeting
>>>>>>>>>>> notes below. I apologize that we were unable to record the session 
>>>>>>>>>>> due to
>>>>>>>>>>> attendees not having record access.
>>>>>>>>>>>
>>>>>>>>>>> Key updates and discussion points:
>>>>>>>>>>>
>>>>>>>>>>> *Decisions:*
>>>>>>>>>>>
>>>>>>>>>>>    - Table Format vs. Parquet: There is a general consensus
>>>>>>>>>>>    that column update support should reside in the table format. 
>>>>>>>>>>> Consequently,
>>>>>>>>>>>    we have discarded the Parquet-native approach.
>>>>>>>>>>>    - Metadata Representation: To maintain clean metadata and
>>>>>>>>>>>    avoid complex resolution logic for readers, the goal is to keep 
>>>>>>>>>>> only one
>>>>>>>>>>>    metadata file per column. However, achieving this is challenging 
>>>>>>>>>>> if we
>>>>>>>>>>>    support partial updates, as multiple column files may exist for 
>>>>>>>>>>> the same
>>>>>>>>>>>    column (See open questions).
>>>>>>>>>>>    - Data Representation: Sparse column files are preferred for
>>>>>>>>>>>    compact representation and are better suited for partial column 
>>>>>>>>>>> updates. We
>>>>>>>>>>>    can optimize sparse representation for vectorized reads by 
>>>>>>>>>>> filling in null
>>>>>>>>>>>    or default values at read time for missing positions from the 
>>>>>>>>>>> base file,
>>>>>>>>>>>    which avoids joins during reads.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Open Questions: *
>>>>>>>>>>>
>>>>>>>>>>>    - We are still determining what restrictions are necessary
>>>>>>>>>>>    when supporting partial updates. For instance, we need to decide 
>>>>>>>>>>> whether to
>>>>>>>>>>>    add a new column and subsequently allow partial updates on it. 
>>>>>>>>>>> This would
>>>>>>>>>>>    involve managing both a base column file and subsequent update 
>>>>>>>>>>> files.
>>>>>>>>>>>    - We need a better understanding of the use cases for
>>>>>>>>>>>    partial updates.
>>>>>>>>>>>    - We need to further discuss the handling of equality
>>>>>>>>>>>    deletes.
>>>>>>>>>>>
>>>>>>>>>>> If I missed anything, or if others took notes, please share them
>>>>>>>>>>> here. Thanks!
>>>>>>>>>>>
>>>>>>>>>>> I will go ahead and update the doc with what we have discussed
>>>>>>>>>>> so we can continue next time from where we left off.
>>>>>>>>>>>
>>>>>>>>>>> ~ Anurag
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 9, 2026 at 11:55 AM Anurag Mantripragada <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> This design
>>>>>>>>>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0>
>>>>>>>>>>>> will be discussed tomorrow in a dedicated sync.
>>>>>>>>>>>>
>>>>>>>>>>>> Efficient column updates sync
>>>>>>>>>>>> Tuesday, February 10 · 9:00 – 10:00am
>>>>>>>>>>>> Time zone: America/Los_Angeles
>>>>>>>>>>>> Google Meet joining info
>>>>>>>>>>>> Video call link: https://meet.google.com/xsd-exug-tcd
>>>>>>>>>>>>
>>>>>>>>>>>> ~ Anurag
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Feb 6, 2026 at 8:30 AM Anurag Mantripragada <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Gabor,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the detailed example.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Steven that Option 2 seems reasonable. I will add
>>>>>>>>>>>>> a section to the design doc regarding equality delete handling, 
>>>>>>>>>>>>> and we can
>>>>>>>>>>>>> discuss this further during our meeting on Tuesday.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ~Anurag
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 6, 2026 at 7:08 AM Steven Wu <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > 1) When deleting with eq-deletes: If there is a column
>>>>>>>>>>>>>> update on the equality-filed ID we use for the delete, reject 
>>>>>>>>>>>>>> deletion
>>>>>>>>>>>>>> > 2) When adding a column update on a column that is part of
>>>>>>>>>>>>>> the equality field IDs in some delete, we reject the column 
>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gabor, this is a good scenario. The 2nd option makes sense to
>>>>>>>>>>>>>> me, since equality ids are like primary key fields. If we have 
>>>>>>>>>>>>>> the 2nd rule
>>>>>>>>>>>>>> enforced, the first option is not applicable anymore.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 6, 2026 at 3:13 AM Gábor Kaszab <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for the proposal, Anurag! I made a pass recently
>>>>>>>>>>>>>>> and I think there is some interference between column updates 
>>>>>>>>>>>>>>> and equality
>>>>>>>>>>>>>>> deletes. Let me describe below:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steps:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> CREATE TABLE tbl (int a, int b);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> INSERT INTO tbl VALUES (1, 11), (2, 22);  -- creates the
>>>>>>>>>>>>>>> base data file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> DELETE FROM tbl WHERE b=11;               -- creates an
>>>>>>>>>>>>>>> equality delete file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> UPDATE tbl SET b=11;                                   -- writes
>>>>>>>>>>>>>>> column update
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SELECT * FROM tbl;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Expected result:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (2, 11)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Data and metadata created after the above steps:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Base file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (1, 11), (2, 22),
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> seqnum=1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> EQ-delete
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> b=11
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> seqnum=2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Column update
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Field ids: [field_id_for_col_b]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> seqnum=3
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Data file content: (dummy_value),(11)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Read steps:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    1. Stitch base file with column updates in reader:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Rows: (1,dummy_value), (2,11) (Note, dummy value can
>>>>>>>>>>>>>>> be either null, or 11, see the proposal for more details)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Seqnum for base file=1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Seqnum for column update=3
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    2. Apply eq-delete b=11, seqnum=3 on the stitched result
>>>>>>>>>>>>>>>    3. Query result depends on which seqnum we carry forward
>>>>>>>>>>>>>>>    to compare with the eq-delete's seqnum, but it's not correct 
>>>>>>>>>>>>>>> in any of the
>>>>>>>>>>>>>>>    cases
>>>>>>>>>>>>>>>       1. Use seqnum from base file: we get either an empty
>>>>>>>>>>>>>>>       result if 'dummy_value' is 11 or we get (1, null) 
>>>>>>>>>>>>>>> otherwise
>>>>>>>>>>>>>>>       2. Use seqnum from last update file: don't delete any
>>>>>>>>>>>>>>>       rows, result set is (1, dummy_value),(2,11)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Problem:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> EQ-delete should be applied midway applying the column
>>>>>>>>>>>>>>> updates to the base file based on sequence number, during the 
>>>>>>>>>>>>>>> stitching
>>>>>>>>>>>>>>> process. If I'm not mistaken, this is not feasible with the way 
>>>>>>>>>>>>>>> readers
>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Proposal:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Don't allow equality deletes together with column updates.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   1) When deleting with eq-deletes: If there is a column
>>>>>>>>>>>>>>> update on the equality-filed ID we use for the delete, reject 
>>>>>>>>>>>>>>> deletion
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   2) When adding a column update on a column that is part of
>>>>>>>>>>>>>>> the equality field IDs in some delete, we reject the column 
>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Alternatively, column updates could be controlled by a
>>>>>>>>>>>>>>> property of the table (immutable), and reject eq-deletes if the 
>>>>>>>>>>>>>>> property
>>>>>>>>>>>>>>> indicates column updates are turned on for the table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me know what you think!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anurag Mantripragada <[email protected]> ezt írta
>>>>>>>>>>>>>>> (időpont: 2026. jan. 28., Sze, 3:31):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you everyone for the initial review comments. It is
>>>>>>>>>>>>>>>> exciting to see so much interest in this proposal.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am currently reviewing and responding to each comment.
>>>>>>>>>>>>>>>> The general themes of the feedback so far include:
>>>>>>>>>>>>>>>> - Including partial updates (column updates on a subset of
>>>>>>>>>>>>>>>> rows in a table).
>>>>>>>>>>>>>>>> - Adding details on how SQL engines will write the update
>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>> - Adding details on split planning and row alignment for
>>>>>>>>>>>>>>>> update files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I will think through these points and update the design
>>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>> Anurag
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jan 27, 2026 at 6:25 PM Anurag Mantripragada <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Xiangin,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Happy to learn from your experience in supporting
>>>>>>>>>>>>>>>>> backfill use-cases. Please feel free to review the proposal 
>>>>>>>>>>>>>>>>> and add your
>>>>>>>>>>>>>>>>> comments. I will wait for a couple of days more to ensure 
>>>>>>>>>>>>>>>>> everyone has a
>>>>>>>>>>>>>>>>> chance to review the proposal.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ~ Anurag
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jan 27, 2026 at 6:42 AM Xianjin Ye <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Anurag and Peter,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It’s great to see the partial column update has gained
>>>>>>>>>>>>>>>>>> great interest in the community. I internally built a 
>>>>>>>>>>>>>>>>>> BackfillColumns
>>>>>>>>>>>>>>>>>> action to efficiently backfill columns(by writing the 
>>>>>>>>>>>>>>>>>> partial columns only
>>>>>>>>>>>>>>>>>> and copies the binary data of other columns into a new 
>>>>>>>>>>>>>>>>>> DataFile). The
>>>>>>>>>>>>>>>>>> speedup could be 10x for wide tables but the write 
>>>>>>>>>>>>>>>>>> amplification is still
>>>>>>>>>>>>>>>>>> there. I would be happy to collaborate on the work and 
>>>>>>>>>>>>>>>>>> eliminate the write
>>>>>>>>>>>>>>>>>> amplification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 2026/01/27 10:12:54 Péter Váry wrote:
>>>>>>>>>>>>>>>>>> > Hi Anurag,
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > It’s great to see how much interest there is in the
>>>>>>>>>>>>>>>>>> community around this
>>>>>>>>>>>>>>>>>> > potential new feature. Gábor and I have actually
>>>>>>>>>>>>>>>>>> submitted an Iceberg
>>>>>>>>>>>>>>>>>> > Summit talk proposal on this topic, and we would be
>>>>>>>>>>>>>>>>>> very happy to
>>>>>>>>>>>>>>>>>> > collaborate on the work. I was mainly waiting for the
>>>>>>>>>>>>>>>>>> File Format API to be
>>>>>>>>>>>>>>>>>> > finalized, as I believe this feature should build on
>>>>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > For reference, our related work includes:
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> >    - *Dev list thread:*
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9
>>>>>>>>>>>>>>>>>> >    - *Proposal document:*
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>>>>>>>>>>>>>>>> >    (not shared widely yet)
>>>>>>>>>>>>>>>>>> >    - *Performance testing PR for readers and writers:*
>>>>>>>>>>>>>>>>>> >    https://github.com/apache/iceberg/pull/13306
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > During earlier discussions about possible metadata
>>>>>>>>>>>>>>>>>> changes, another option
>>>>>>>>>>>>>>>>>> > came up that hasn’t been documented yet: separating
>>>>>>>>>>>>>>>>>> planner metadata from
>>>>>>>>>>>>>>>>>> > reader metadata. Since the planner does not need to
>>>>>>>>>>>>>>>>>> know about the actual
>>>>>>>>>>>>>>>>>> > files, we could store the file composition in a
>>>>>>>>>>>>>>>>>> separate file (potentially
>>>>>>>>>>>>>>>>>> > a Puffin file). This file could hold the column_files
>>>>>>>>>>>>>>>>>> metadata, while the
>>>>>>>>>>>>>>>>>> > manifest would reference the Puffin file and blob
>>>>>>>>>>>>>>>>>> position instead of the
>>>>>>>>>>>>>>>>>> > data filename.
>>>>>>>>>>>>>>>>>> > This approach has the advantage of keeping the existing
>>>>>>>>>>>>>>>>>> metadata largely
>>>>>>>>>>>>>>>>>> > intact, and it could also give us a natural place later
>>>>>>>>>>>>>>>>>> to add file-level
>>>>>>>>>>>>>>>>>> > indexes or Bloom filters for use during reads or
>>>>>>>>>>>>>>>>>> secondary filtering. The
>>>>>>>>>>>>>>>>>> > downsides are the additional files and the increased
>>>>>>>>>>>>>>>>>> complexity of
>>>>>>>>>>>>>>>>>> > identifying files that are no longer referenced by the
>>>>>>>>>>>>>>>>>> table, so this may
>>>>>>>>>>>>>>>>>> > not be an ideal solution.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > I do have some concerns about the MoR metadata proposal
>>>>>>>>>>>>>>>>>> described in the
>>>>>>>>>>>>>>>>>> > document. At first glance, it seems to complicate
>>>>>>>>>>>>>>>>>> distributed planning, as
>>>>>>>>>>>>>>>>>> > all entries for a given file would need to be collected
>>>>>>>>>>>>>>>>>> and merged to
>>>>>>>>>>>>>>>>>> > provide the information required by both the planner
>>>>>>>>>>>>>>>>>> and the reader.
>>>>>>>>>>>>>>>>>> > Additionally, when a new column is added or updated, we
>>>>>>>>>>>>>>>>>> would still need to
>>>>>>>>>>>>>>>>>> > add a new metadata entry for every existing data file.
>>>>>>>>>>>>>>>>>> If we immediately
>>>>>>>>>>>>>>>>>> > write out the merged metadata, the total number of
>>>>>>>>>>>>>>>>>> entries remains the
>>>>>>>>>>>>>>>>>> > same. The main benefit is avoiding rewriting
>>>>>>>>>>>>>>>>>> statistics, which can be
>>>>>>>>>>>>>>>>>> > significant, but this comes at the cost of increased
>>>>>>>>>>>>>>>>>> planning complexity.
>>>>>>>>>>>>>>>>>> > If we choose to store the merged statistics in the
>>>>>>>>>>>>>>>>>> column_families entry, I
>>>>>>>>>>>>>>>>>> > don’t see much benefit in excluding the rest of the
>>>>>>>>>>>>>>>>>> metadata, especially
>>>>>>>>>>>>>>>>>> > since including it would simplify the planning process.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > As Anton already pointed out, we should also discuss
>>>>>>>>>>>>>>>>>> how this change would
>>>>>>>>>>>>>>>>>> > affect split handling, particularly how to avoid double
>>>>>>>>>>>>>>>>>> reads when row
>>>>>>>>>>>>>>>>>> > groups are not aligned between the original data files
>>>>>>>>>>>>>>>>>> and the new column
>>>>>>>>>>>>>>>>>> > files.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Finally, I’d like to see some discussion around the
>>>>>>>>>>>>>>>>>> Java API implications.
>>>>>>>>>>>>>>>>>> > In particular, what API changes are required, and how
>>>>>>>>>>>>>>>>>> SQL engines would
>>>>>>>>>>>>>>>>>> > perform updates. Since the new column files must have
>>>>>>>>>>>>>>>>>> the same number of
>>>>>>>>>>>>>>>>>> > rows as the original data files, with a strict
>>>>>>>>>>>>>>>>>> one-to-one relationship, SQL
>>>>>>>>>>>>>>>>>> > engines would need access to the source filename,
>>>>>>>>>>>>>>>>>> position, and deletion
>>>>>>>>>>>>>>>>>> > status in the DataFrame in order to generate the new
>>>>>>>>>>>>>>>>>> files. This is more
>>>>>>>>>>>>>>>>>> > involved than a simple update and deserves some
>>>>>>>>>>>>>>>>>> explicit consideration.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Looking forward to your thoughts.
>>>>>>>>>>>>>>>>>> > Best regards,
>>>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > On Tue, Jan 27, 2026, 03:58 Anurag Mantripragada <
>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > > Thanks Anton and others, for providing some initial
>>>>>>>>>>>>>>>>>> feedback. I will
>>>>>>>>>>>>>>>>>> > > address all your comments soon.
>>>>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>>>>> > > On Mon, Jan 26, 2026 at 11:10 AM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> > > wrote:
>>>>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>>>>> > >> I had a chance to see the proposal before it landed
>>>>>>>>>>>>>>>>>> and I think it is a
>>>>>>>>>>>>>>>>>> > >> cool idea and both presented approaches would likely
>>>>>>>>>>>>>>>>>> work. I am looking
>>>>>>>>>>>>>>>>>> > >> forward to discussing the tradeoffs and would
>>>>>>>>>>>>>>>>>> encourage everyone to
>>>>>>>>>>>>>>>>>> > >> push/polish each approach to see what issues can be
>>>>>>>>>>>>>>>>>> mitigated and what are
>>>>>>>>>>>>>>>>>> > >> fundamental.
>>>>>>>>>>>>>>>>>> > >>
>>>>>>>>>>>>>>>>>> > >> [1] Iceberg-native approach: better visibility into
>>>>>>>>>>>>>>>>>> column files from the
>>>>>>>>>>>>>>>>>> > >> metadata, potentially better concurrency for
>>>>>>>>>>>>>>>>>> non-overlapping column
>>>>>>>>>>>>>>>>>> > >> updates, no dep on Parquet.
>>>>>>>>>>>>>>>>>> > >> [2] Parquet-native approach: almost no changes to
>>>>>>>>>>>>>>>>>> the table format
>>>>>>>>>>>>>>>>>> > >> metadata beyond tracking of base files.
>>>>>>>>>>>>>>>>>> > >>
>>>>>>>>>>>>>>>>>> > >> I think [1] sounds a bit better on paper but I am
>>>>>>>>>>>>>>>>>> worried about the
>>>>>>>>>>>>>>>>>> > >> complexity in writers and readers (especially around
>>>>>>>>>>>>>>>>>> keeping row groups
>>>>>>>>>>>>>>>>>> > >> aligned and split planning). It would be great to
>>>>>>>>>>>>>>>>>> cover this in detail in
>>>>>>>>>>>>>>>>>> > >> the proposal.
>>>>>>>>>>>>>>>>>> > >>
>>>>>>>>>>>>>>>>>> > >> пн, 26 січ. 2026 р. о 09:00 Anurag Mantripragada <
>>>>>>>>>>>>>>>>>> > >> [email protected]> пише:
>>>>>>>>>>>>>>>>>> > >>
>>>>>>>>>>>>>>>>>> > >>> Hi all,
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> "Wide tables" with thousands of columns present
>>>>>>>>>>>>>>>>>> significant challenges
>>>>>>>>>>>>>>>>>> > >>> for AI/ML workloads, particularly when only a
>>>>>>>>>>>>>>>>>> subset of columns needs to be
>>>>>>>>>>>>>>>>>> > >>> added or updated. Current Copy-on-Write (COW) and
>>>>>>>>>>>>>>>>>> Merge-on-Read (MOR)
>>>>>>>>>>>>>>>>>> > >>> operations in Iceberg apply at the row level, which
>>>>>>>>>>>>>>>>>> leads to substantial
>>>>>>>>>>>>>>>>>> > >>> write amplification in scenarios such as:
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>>    - Feature Backfilling & Column Updates: Adding
>>>>>>>>>>>>>>>>>> new feature columns
>>>>>>>>>>>>>>>>>> > >>>    (e.g., model embeddings) to petabyte-scale
>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>> > >>>    - Model Score Updates: Refresh prediction scores
>>>>>>>>>>>>>>>>>> after retraining.
>>>>>>>>>>>>>>>>>> > >>>    - Embedding Refresh: Updating vector embeddings,
>>>>>>>>>>>>>>>>>> which currently
>>>>>>>>>>>>>>>>>> > >>>    triggers a rewrite of the entire row.
>>>>>>>>>>>>>>>>>> > >>>    - Incremental Feature Computation: Daily updates
>>>>>>>>>>>>>>>>>> to a small fraction
>>>>>>>>>>>>>>>>>> > >>>    of features in wide tables.
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> With the Iceberg V4 proposal introducing
>>>>>>>>>>>>>>>>>> single-file commits and column
>>>>>>>>>>>>>>>>>> > >>> stats improvements, this is an ideal time to
>>>>>>>>>>>>>>>>>> address column-level updates
>>>>>>>>>>>>>>>>>> > >>> to better support these use cases.
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> I have drafted a proposal that explores both
>>>>>>>>>>>>>>>>>> table-format enhancements
>>>>>>>>>>>>>>>>>> > >>> and file-format (Parquet) changes to enable more
>>>>>>>>>>>>>>>>>> efficient updates.
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> Proposal Details:
>>>>>>>>>>>>>>>>>> > >>> - GitHub Issue: #15146 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/15146>
>>>>>>>>>>>>>>>>>> > >>> - Design Document: Efficient Column Updates in
>>>>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>>>>> > >>> <
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> Next Steps:
>>>>>>>>>>>>>>>>>> > >>> I plan to create POCs to benchmark the approaches
>>>>>>>>>>>>>>>>>> described in the
>>>>>>>>>>>>>>>>>> > >>> document.
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> Please review the proposal and share your feedback.
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>> Thanks,
>>>>>>>>>>>>>>>>>> > >>> Anurag
>>>>>>>>>>>>>>>>>> > >>>
>>>>>>>>>>>>>>>>>> > >>
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [Discuss] Efficient column updates in Iceberg

Reply via email to