Hey Iceberg Community,

Anurag started a separate, focused discussion
<https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on the
column update file representation, similarly, let me start another one for
the metadata representation. Hopefully, we can make some iterations on this
before the next sync.

We covered this topic in the sync yesterday and agreed on some of the
fields, but we left the "tracking" information part open. The *required*
fields we agreed on so far:

ColumnFile
field_ids list<int>
location string
file_size_in_bytes long

*Tracking information*
Additionally to the above, we discussed the need of tracking information.
These are the potential ones:

*1) Sequence number*

   - Usage for _last_updated_sequence_number

I did think about how to produce _last_updated_sequence_number and I think
technically we don't need to store the sequence number on the update file
level for that. I wrote up the steps here
<https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>,
but in a nutshell: we could either fille that from the
_last_updated_sequence_number written into the latest column file, or if
null we can use the base file's file_sequence_number.

   - Usage for equality deletes

As we agreed previously, we don't want to support update files together
with equality deletes, so we won't need to store column file level sequence
numbers for this either.

   - Usage for CDC, observability, etc.

I'm wondering if there is any use case where we want to see the order of
the column updates to see the sequence they were created. If this matters
for CDC or reproducibility or anything else, then let's have a column file
level sequence number too, if not, we can omit this.

*2) Status*
I think, similarly to TrackedFile, we need the following statuses here:
EXISTING, ADDED, DELETED, REPLACED
With these, when the base file's status is REPLACED, taking a look at the
column_files we can know exactly what has changed wrt the column updates.
Some examples to demonstrate:

Step 1: Start with an existing base file:
base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1, status:
EXISTING, column_files:[]}

Step 2: Adding a column update for field IDs [1, 2]:
base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*,
status: *REPLACED*,
                column_files: [ *{field_ids: [1, 2], location:
"update1.parquet", status: ADDED}* ]}

Step 3: Adding an overlapping column update with field IDs [2, 3]
("de-duplicate" field IDs):
base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status:
 REPLACED,
                column_files: [ {field_ids: *[1],* location:
"update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location:
"update2.parquet", status: ADDED}* ]}

Step 4: Add another column update for field ID [1] to completely eliminate
one previous update file from metadata
base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*, status:
 REPLACED,
                column_files: [ {field_ids: [1]*,* location:
"update1.parquet", status: *DELETED},*  {field_ids: [2, 3], location:
"update2.parquet", status: *EXISTING*}, *{field_ids: [1], location:
"update3.parquet", status: ADDED}* ]}

*Thoughts on REPLACED*
In step 3, we marked the existing column file as REPLACED while reducing
the field_ids list to de-duplicate them with the incoming update
file's field_ids. With this, REPLACED indicates that field_ids content was
reduced, however, we won't know exactly what field IDs were removed.

  - Alternative approach 1:
We could use DELETED status leaving the field ID list intact, and then
create a new ColumnFile with the reduced list. Step 3 would look like this:

base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status:
 REPLACED,
                column_files: [ {field_ids: [1, 2]*,* location:
"update1.parquet", status: *DELETED}, **{field_ids: [1], location:
"update1.parquet", status: ADDED}, **{field_ids: [2, 3], location:
"update2.parquet", status: ADDED}* ]}

  - Alternative approach 2:
We can use REPLACED as originally, and also have a field in the tracking
data to *keep track of the removed field IDs* (similarly to
Tracking.DELETED_POSITIONS). Step 3 would look like this:

base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status:
 REPLACED,
                column_files: [ {field_ids: *[1],* location:
"update1.parquet", status: *REPLACED, removed_field_ids: [2]}, **{field_ids:
[2, 3], location: "update2.parquet", status: ADDED}* ]}

  - Preference:
I think the REPLACED approach is cleaner, I'd prefer that. In case we want
to track what IDs were removed, we could follow "alternative approach 2".

  - Additional, note:
Re-writing the column file as REPLACED shouldn't alter the sequence number
of the column file (if we decide to have one).

*3) Snapshot ID*
'Tracking' has this, I think it could make sense for column files too.

*4) First row ID*
Row IDs should come from the base file's metadata IMO, we shouldn't store
this for the update files.

*Summary of all the potential tracking fields:*

ColumnFileTracking
required status int
optional snapshot_id long
optional sequence_number long
optional removed_field_ids list<int>

*Field IDs*
The first free field ID within TrackedFile is 157. The last used one is
DeletionVector.CARDINALITY
<https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42>
with field ID 156.
I'm working with Amogh to coordinate assigning the required field IDs here.

Let me know if I miss anything here! Any feedback is appreciated!

Best Regards,
Gabor

Reply via email to