Hey Iceberg Community,

I discussed this with Amogh today, and apparently I was trying to solve
change detection on a lower level (per-column file) than optimal. According
to our conversation, here is an update of the proposed metadata structure
for column updates:
Tracking
...
optional latest_column_file_sequence_number long
optional latest_column_file_snapshot_id long

ColumnFile
required field_ids list<int>
required location string
required file_size_in_bytes long
Note, there is no need to track per column file status, sequence numbers
and snapshot IDs as we see currently.

How change detection would work:
Whenever a column file is added to an entry (TrackedFile) the old entry
would be part of the new snapshot with REPLACED status and there would be a
new entry with MODIFIED status that contains the addition of the new column
file. With this, finding the matching REPLACED and MODIFIED entries by
location we could see the differences in the column list. Also
Tracking.latest_column_file_snapshot_id == current_snapshot_id() also helps
that we should take a look at column files to see the changes.

@amogh, let me know if I missed something!

Any feedback is appreciated!
Gabor


Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 22., P,
16:48):

> Thank you for taking a look and getting back with your thoughts, Steven!
>
> You're right, we shouldn't repurpose the semantics of existing fields. The
> idea to keep track of* latest_column_file_sequence_number *as you
> described in the doc makes sense to me. Going with this design, we won't
> need to keep sequence numbers on a column file level (unless we want to
> know the order they were added, but I don't see a use case for that).
>
> I see the following metadata structure changes:
> Tracking
> ...
> optional latest_column_file_sequence_number long
> ColumnFile
> required field_ids list<int>
> required location string
> required file_size_in_bytes long
> required column_file_tracking ColumnFileTracking
> ColumnFileTracking
> required status int
> optional snapshot_id long
> optional removed_field_ids list<int>
> ColumnFileTracking.status could have values {ADDED, EXISTING, DELETED,
> REPLACED} similarly to TrackedFile. With this we could have a clear idea
> exactly what changed wrt the column files simply taking a look at the
> column files metadata. See details in my first mail in this thread.
>
> Would be nice to hear further feedback on this!
> Best Regards,
> Gabor Kaszab
>
> Steven Wu <[email protected]> ezt írta (időpont: 2026. máj. 21., Cs,
> 23:55):
>
>> Gabor, thanks for starting this discussion.
>>
>> I have been thinking about this problem independently since the column
>> update sync. Here is the detailed design document
>> <https://docs.google.com/document/d/160-FizR6zOASMb86NycfgCm7cZbh6HK7FLcUW8_Xp-0/edit?usp=sharing>
>> .
>>
>> Gabor, I read your section of How to support
>> _last_updated_sequence_number
>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.xvm52pv4m7lq#heading=h.k2neg79ocgu>.
>> If I understand correctly, the proposal is to repurpose
>> file_sequence_number to capture the snapshot sequence number of the latest
>> column file. I suggest we don't change the semantics of the existing
>> file_sequence_number. Instead, we can introduce a new
>> latest_column_file_sequence_number field in the tracking struct. My doc
>> described the reasoning.
>>
>> That is the only real difference as far as I can tell. Otherwise, I think
>> we had the same idea/design.
>>
>> On Wed, May 20, 2026 at 11:56 AM Gábor Kaszab <[email protected]>
>> wrote:
>>
>>> Hey Iceberg Community,
>>>
>>> Anurag started a separate, focused discussion
>>> <https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on
>>> the column update file representation, similarly, let me start another one
>>> for the metadata representation. Hopefully, we can make some iterations on
>>> this before the next sync.
>>>
>>> We covered this topic in the sync yesterday and agreed on some of the
>>> fields, but we left the "tracking" information part open. The *required*
>>> fields we agreed on so far:
>>>
>>> ColumnFile
>>> field_ids list<int>
>>> location string
>>> file_size_in_bytes long
>>>
>>> *Tracking information*
>>> Additionally to the above, we discussed the need of tracking
>>> information. These are the potential ones:
>>>
>>> *1) Sequence number*
>>>
>>>    - Usage for _last_updated_sequence_number
>>>
>>> I did think about how to produce _last_updated_sequence_number and I
>>> think technically we don't need to store the sequence number on the update
>>> file level for that. I wrote up the steps here
>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>,
>>> but in a nutshell: we could either fille that from the
>>> _last_updated_sequence_number written into the latest column file, or if
>>> null we can use the base file's file_sequence_number.
>>>
>>>    - Usage for equality deletes
>>>
>>> As we agreed previously, we don't want to support update files together
>>> with equality deletes, so we won't need to store column file level sequence
>>> numbers for this either.
>>>
>>>    - Usage for CDC, observability, etc.
>>>
>>> I'm wondering if there is any use case where we want to see the order of
>>> the column updates to see the sequence they were created. If this matters
>>> for CDC or reproducibility or anything else, then let's have a column file
>>> level sequence number too, if not, we can omit this.
>>>
>>> *2) Status*
>>> I think, similarly to TrackedFile, we need the following statuses here:
>>> EXISTING, ADDED, DELETED, REPLACED
>>> With these, when the base file's status is REPLACED, taking a look at
>>> the column_files we can know exactly what has changed wrt the column
>>> updates. Some examples to demonstrate:
>>>
>>> Step 1: Start with an existing base file:
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1,
>>> status: EXISTING, column_files:[]}
>>>
>>> Step 2: Adding a column update for field IDs [1, 2]:
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*,
>>> status: *REPLACED*,
>>>                 column_files: [ *{field_ids: [1, 2], location:
>>> "update1.parquet", status: ADDED}* ]}
>>>
>>> Step 3: Adding an overlapping column update with field IDs [2, 3]
>>> ("de-duplicate" field IDs):
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>> status: REPLACED,
>>>                 column_files: [ {field_ids: *[1],* location:
>>> "update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location:
>>> "update2.parquet", status: ADDED}* ]}
>>>
>>> Step 4: Add another column update for field ID [1] to completely
>>> eliminate one previous update file from metadata
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*,
>>> status: REPLACED,
>>>                 column_files: [ {field_ids: [1]*,* location:
>>> "update1.parquet", status: *DELETED},*  {field_ids: [2, 3], location:
>>> "update2.parquet", status: *EXISTING*}, *{field_ids: [1], location:
>>> "update3.parquet", status: ADDED}* ]}
>>>
>>> *Thoughts on REPLACED*
>>> In step 3, we marked the existing column file as REPLACED while reducing
>>> the field_ids list to de-duplicate them with the incoming update
>>> file's field_ids. With this, REPLACED indicates that field_ids content was
>>> reduced, however, we won't know exactly what field IDs were removed.
>>>
>>>   - Alternative approach 1:
>>> We could use DELETED status leaving the field ID list intact, and then
>>> create a new ColumnFile with the reduced list. Step 3 would look like this:
>>>
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>> status: REPLACED,
>>>                 column_files: [ {field_ids: [1, 2]*,* location:
>>> "update1.parquet", status: *DELETED}, **{field_ids: [1], location:
>>> "update1.parquet", status: ADDED}, **{field_ids: [2, 3], location:
>>> "update2.parquet", status: ADDED}* ]}
>>>
>>>   - Alternative approach 2:
>>> We can use REPLACED as originally, and also have a field in the tracking
>>> data to *keep track of the removed field IDs* (similarly to
>>> Tracking.DELETED_POSITIONS). Step 3 would look like this:
>>>
>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>> status: REPLACED,
>>>                 column_files: [ {field_ids: *[1],* location:
>>> "update1.parquet", status: *REPLACED, removed_field_ids: [2]}, **{field_ids:
>>> [2, 3], location: "update2.parquet", status: ADDED}* ]}
>>>
>>>   - Preference:
>>> I think the REPLACED approach is cleaner, I'd prefer that. In case we
>>> want to track what IDs were removed, we could follow "alternative approach
>>> 2".
>>>
>>>   - Additional, note:
>>> Re-writing the column file as REPLACED shouldn't alter the sequence
>>> number of the column file (if we decide to have one).
>>>
>>> *3) Snapshot ID*
>>> 'Tracking' has this, I think it could make sense for column files too.
>>>
>>> *4) First row ID*
>>> Row IDs should come from the base file's metadata IMO, we shouldn't
>>> store this for the update files.
>>>
>>> *Summary of all the potential tracking fields:*
>>>
>>> ColumnFileTracking
>>> required status int
>>> optional snapshot_id long
>>> optional sequence_number long
>>> optional removed_field_ids list<int>
>>>
>>> *Field IDs*
>>> The first free field ID within TrackedFile is 157. The last used one is
>>> DeletionVector.CARDINALITY
>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42>
>>> with field ID 156.
>>> I'm working with Amogh to coordinate assigning the required field IDs
>>> here.
>>>
>>> Let me know if I miss anything here! Any feedback is appreciated!
>>>
>>> Best Regards,
>>> Gabor
>>>
>>>

Reply via email to