Re: [Discuss] Column Update metadata representation

Gábor Kaszab Tue, 02 Jun 2026 04:58:09 -0700

Hey Iceberg Community,

Yesterday in the V4 AMT sync there were some discussions that are also
relevant here. Let me summarize!


1) No new sequence number(s)
One decision we made is to not introduce a new field for the sequence
number of the latest column file, but to *bump the data sequence number of
TrackedFile* whenever we add a new column file. With this what we can
achieve:

   - Use TrackedFile.dataSequenceNumber() for last_updated_sequence_number
   for the null LUSN values in the latest update file
   - We can get rid of the equality deletes associated to the base file
   when adding a column update (current direction is to rewrite eq-deletes to
   DVs when adding column updates)


With this, the metadata structures would look like the following:
Tracking
...
optional latest_column_file_snapshot_id long
ColumnFile
required field_ids list<int>
required location string
required file_size_in_bytes long
2) Treat base file as a regular column file
This seemed a more exploratory idea: What if we include (some of the) base
file details into TrackedFile.column_files? With this we could have the
following advantages:

   - We won't need the "duplication" of some field in the schema, like
   TrackedFile.location vs ColumnFile.location, and the same for
   file_size_in_bytes (maybe also for split_offsets)
   - This design would be a more natural fit for column families (not being
   designed now but seems there is some interest for the future)

While I think this approach makes sense, there are some areas that *we have
to think through*:

   - *Making base file a first class citizen of column_files would require
   the following:*
      - Move location, file_size_in_bytes and potentially split_offsets
      from TrackedFile into ColumnFile
      - Populate fieldIDs list for the base file entry too
         - Include row_id into the field IDs too. Don't allow to change
         this field to be served by regular column files?
         - We tackle wide tables here: 1k cols => 4KB, 10k cols => 40KB
         in-memory size. On storage Parquet V2 probably compresses
this efficiently.
         - Original projection algorithm: Find whatever projected fields
         are covered by column files, leave the rest for the base file
         - We could leave field ID list empty for base file, but then not
         all column files are treated equal
      - *Do we have to know which column file is the base file?*
      - Projections: If we keep a full column ID list, we don't have to
      make a distinction of base file from column updates for this
      - _file metadata column: We still want to populate this with base
      file's location
         - We can keep TrackedFile.location() that could delegate to the
         relevant column file (or keep an in-memory, not persisted field in
         TrackedFile for this)
      - Positional deletes: DVs no longer use file path to find the
      relevant data file. No need to find base file's path
      - Do we have other use cases where this could cause a mess?
   - *Mental model change: Physical file => logical file*
      - Instead of saying "TrackedFile is a physical file with additional
      physical column files" we switch to "TrackedFile is a logical
file that is
      represented by physical column files"
      - In the current mental model, the ID of a file is its location
      - In a mental model of logical files, do we still want to use
      location as an ID?
         - Is it the base file's location? Somewhat hurts the mental model
         IMO
         - Introducing general "file ID"? Is this too invasive to change?
      - *Split offsets:*
      - Currently, planning for split reading uses the split offsets of the
      base file
      - Technically, we can move split_offsets to the ColumnFile-level and
      let the planner to decide on which column file to use for splits

Let me know if I missed anything! Looking for your opinions!
Gabor


Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 28.,
Cs, 16:45):

> Hey Iceberg Community,
>
> I discussed this with Amogh today, and apparently I was trying to solve
> change detection on a lower level (per-column file) than optimal. According
> to our conversation, here is an update of the proposed metadata structure
> for column updates:
> Tracking
> ...
> optional latest_column_file_sequence_number long
> optional latest_column_file_snapshot_id long
>
> ColumnFile
> required field_ids list<int>
> required location string
> required file_size_in_bytes long
> Note, there is no need to track per column file status, sequence numbers
> and snapshot IDs as we see currently.
>
> How change detection would work:
> Whenever a column file is added to an entry (TrackedFile) the old entry
> would be part of the new snapshot with REPLACED status and there would be a
> new entry with MODIFIED status that contains the addition of the new column
> file. With this, finding the matching REPLACED and MODIFIED entries by
> location we could see the differences in the column list. Also
> Tracking.latest_column_file_snapshot_id == current_snapshot_id() also helps
> that we should take a look at column files to see the changes.
>
> @amogh, let me know if I missed something!
>
> Any feedback is appreciated!
> Gabor
>
>
> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 22.,
> P, 16:48):
>
>> Thank you for taking a look and getting back with your thoughts, Steven!
>>
>> You're right, we shouldn't repurpose the semantics of existing fields.
>> The idea to keep track of* latest_column_file_sequence_number *as you
>> described in the doc makes sense to me. Going with this design, we won't
>> need to keep sequence numbers on a column file level (unless we want to
>> know the order they were added, but I don't see a use case for that).
>>
>> I see the following metadata structure changes:
>> Tracking
>> ...
>> optional latest_column_file_sequence_number long
>> ColumnFile
>> required field_ids list<int>
>> required location string
>> required file_size_in_bytes long
>> required column_file_tracking ColumnFileTracking
>> ColumnFileTracking
>> required status int
>> optional snapshot_id long
>> optional removed_field_ids list<int>
>> ColumnFileTracking.status could have values {ADDED, EXISTING, DELETED,
>> REPLACED} similarly to TrackedFile. With this we could have a clear idea
>> exactly what changed wrt the column files simply taking a look at the
>> column files metadata. See details in my first mail in this thread.
>>
>> Would be nice to hear further feedback on this!
>> Best Regards,
>> Gabor Kaszab
>>
>> Steven Wu <[email protected]> ezt írta (időpont: 2026. máj. 21., Cs,
>> 23:55):
>>
>>> Gabor, thanks for starting this discussion.
>>>
>>> I have been thinking about this problem independently since the column
>>> update sync. Here is the detailed design document
>>> <https://docs.google.com/document/d/160-FizR6zOASMb86NycfgCm7cZbh6HK7FLcUW8_Xp-0/edit?usp=sharing>
>>> .
>>>
>>> Gabor, I read your section of How to support
>>> _last_updated_sequence_number
>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.xvm52pv4m7lq#heading=h.k2neg79ocgu>.
>>> If I understand correctly, the proposal is to repurpose
>>> file_sequence_number to capture the snapshot sequence number of the latest
>>> column file. I suggest we don't change the semantics of the existing
>>> file_sequence_number. Instead, we can introduce a new
>>> latest_column_file_sequence_number field in the tracking struct. My doc
>>> described the reasoning.
>>>
>>> That is the only real difference as far as I can tell. Otherwise, I
>>> think we had the same idea/design.
>>>
>>> On Wed, May 20, 2026 at 11:56 AM Gábor Kaszab <[email protected]>
>>> wrote:
>>>
>>>> Hey Iceberg Community,
>>>>
>>>> Anurag started a separate, focused discussion
>>>> <https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on
>>>> the column update file representation, similarly, let me start another one
>>>> for the metadata representation. Hopefully, we can make some iterations on
>>>> this before the next sync.
>>>>
>>>> We covered this topic in the sync yesterday and agreed on some of the
>>>> fields, but we left the "tracking" information part open. The
>>>> *required* fields we agreed on so far:
>>>>
>>>> ColumnFile
>>>> field_ids list<int>
>>>> location string
>>>> file_size_in_bytes long
>>>>
>>>> *Tracking information*
>>>> Additionally to the above, we discussed the need of tracking
>>>> information. These are the potential ones:
>>>>
>>>> *1) Sequence number*
>>>>
>>>>    - Usage for _last_updated_sequence_number
>>>>
>>>> I did think about how to produce _last_updated_sequence_number and I
>>>> think technically we don't need to store the sequence number on the update
>>>> file level for that. I wrote up the steps here
>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>,
>>>> but in a nutshell: we could either fille that from the
>>>> _last_updated_sequence_number written into the latest column file, or if
>>>> null we can use the base file's file_sequence_number.
>>>>
>>>>    - Usage for equality deletes
>>>>
>>>> As we agreed previously, we don't want to support update files together
>>>> with equality deletes, so we won't need to store column file level sequence
>>>> numbers for this either.
>>>>
>>>>    - Usage for CDC, observability, etc.
>>>>
>>>> I'm wondering if there is any use case where we want to see the order
>>>> of the column updates to see the sequence they were created. If this
>>>> matters for CDC or reproducibility or anything else, then let's have a
>>>> column file level sequence number too, if not, we can omit this.
>>>>
>>>> *2) Status*
>>>> I think, similarly to TrackedFile, we need the following statuses here:
>>>> EXISTING, ADDED, DELETED, REPLACED
>>>> With these, when the base file's status is REPLACED, taking a look at
>>>> the column_files we can know exactly what has changed wrt the column
>>>> updates. Some examples to demonstrate:
>>>>
>>>> Step 1: Start with an existing base file:
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1,
>>>> status: EXISTING, column_files:[]}
>>>>
>>>> Step 2: Adding a column update for field IDs [1, 2]:
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*,
>>>> status: *REPLACED*,
>>>>                 column_files: [ *{field_ids: [1, 2], location:
>>>> "update1.parquet", status: ADDED}* ]}
>>>>
>>>> Step 3: Adding an overlapping column update with field IDs [2, 3]
>>>> ("de-duplicate" field IDs):
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>>> status: REPLACED,
>>>>                 column_files: [ {field_ids: *[1],* location:
>>>> "update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location:
>>>> "update2.parquet", status: ADDED}* ]}
>>>>
>>>> Step 4: Add another column update for field ID [1] to completely
>>>> eliminate one previous update file from metadata
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*,
>>>> status: REPLACED,
>>>>                 column_files: [ {field_ids: [1]*,* location:
>>>> "update1.parquet", status: *DELETED},*  {field_ids: [2, 3], location:
>>>> "update2.parquet", status: *EXISTING*}, *{field_ids: [1], location:
>>>> "update3.parquet", status: ADDED}* ]}
>>>>
>>>> *Thoughts on REPLACED*
>>>> In step 3, we marked the existing column file as REPLACED while
>>>> reducing the field_ids list to de-duplicate them with the incoming update
>>>> file's field_ids. With this, REPLACED indicates that field_ids content was
>>>> reduced, however, we won't know exactly what field IDs were removed.
>>>>
>>>>   - Alternative approach 1:
>>>> We could use DELETED status leaving the field ID list intact, and then
>>>> create a new ColumnFile with the reduced list. Step 3 would look like this:
>>>>
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>>> status: REPLACED,
>>>>                 column_files: [ {field_ids: [1, 2]*,* location:
>>>> "update1.parquet", status: *DELETED}, **{field_ids: [1], location:
>>>> "update1.parquet", status: ADDED}, **{field_ids: [2, 3], location:
>>>> "update2.parquet", status: ADDED}* ]}
>>>>
>>>>   - Alternative approach 2:
>>>> We can use REPLACED as originally, and also have a field in the
>>>> tracking data to *keep track of the removed field IDs* (similarly to
>>>> Tracking.DELETED_POSITIONS). Step 3 would look like this:
>>>>
>>>> base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*,
>>>> status: REPLACED,
>>>>                 column_files: [ {field_ids: *[1],* location:
>>>> "update1.parquet", status: *REPLACED, removed_field_ids: [2]}, 
>>>> **{field_ids:
>>>> [2, 3], location: "update2.parquet", status: ADDED}* ]}
>>>>
>>>>   - Preference:
>>>> I think the REPLACED approach is cleaner, I'd prefer that. In case we
>>>> want to track what IDs were removed, we could follow "alternative approach
>>>> 2".
>>>>
>>>>   - Additional, note:
>>>> Re-writing the column file as REPLACED shouldn't alter the sequence
>>>> number of the column file (if we decide to have one).
>>>>
>>>> *3) Snapshot ID*
>>>> 'Tracking' has this, I think it could make sense for column files too.
>>>>
>>>> *4) First row ID*
>>>> Row IDs should come from the base file's metadata IMO, we shouldn't
>>>> store this for the update files.
>>>>
>>>> *Summary of all the potential tracking fields:*
>>>>
>>>> ColumnFileTracking
>>>> required status int
>>>> optional snapshot_id long
>>>> optional sequence_number long
>>>> optional removed_field_ids list<int>
>>>>
>>>> *Field IDs*
>>>> The first free field ID within TrackedFile is 157. The last used one is
>>>> DeletionVector.CARDINALITY
>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42>
>>>> with field ID 156.
>>>> I'm working with Amogh to coordinate assigning the required field IDs
>>>> here.
>>>>
>>>> Let me know if I miss anything here! Any feedback is appreciated!
>>>>
>>>> Best Regards,
>>>> Gabor
>>>>
>>>>

Re: [Discuss] Column Update metadata representation

Reply via email to