[I] [Spec] The difference between source-id, field-id and column-id is unclear [iceberg]

via GitHub Wed, 02 Jul 2025 03:58:34 -0700


Tishj opened a new issue, #13446:
URL: https://github.com/apache/iceberg/issues/13446


   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   In the spec (https://iceberg.apache.org/spec/)
   
   It is unclear what the difference between a column id, a field-id and a 
source-id is in the spec.
   (But I believe they are all synonymous with each other)
   
   ### Source Id
   
   There are 16 mentions of `source-id`, in none of them it is explained how 
it's different from field-id or column-id.
   
   The only way to piece this information together is this line:
   > A source column id or a list of source column ids from the table’s schema
   
   That is the **only** place where a `source-id` is referenced in an 
explanation, and it's not even searchable because it uses **source column id**, 
which is a term that is never used in conjunction with `source-id`.
   
   ### Column Id and Field Id
   
   There is a field `last-column-id` in the Table Metadata, that talks about a 
column ID.
   I can decipher from context that this talks about the `id` field of the 
entries in the Schema, but this is also never concisely tied together.
   
   `column id` is synonymous with `field-id`, which is also only really 
explained by this:
   > Column IDs are required to be stored as [field 
IDs](http://github.com/apache/parquet-format/blob/40699d05bd24181de6b1457babbee2c16dce3803/src/main/thrift/parquet.thrift#L459)
 on the parquet schema.
   
   For context, I have consumed this spec for the past months, and for the 
longest time I was convinced there was a difference between the two, expecting 
one of them (column id or source id) to refer to the **order** of appearance in 
the schema.
   
   ### Inconsistent use of `-` and `_`
   
   I'm not talking about the tables where they are listed, those are clear as 
day, but when they're referred to, the correct name isn't always used.
   
   One example with `next-row-id`:
   > - When a table is upgraded to v3, `next_row_id` should be initialized to 0
   > - When committing a new snapshot `next-row-id` must be incremented by at 
least the number of newly assigned row ids in the snapshot
   
   This makes it so that you have to search for both variants to get all the 
mentions of the field.
   
   It would be great if some time can be spent on making these connections more 
clear, as the answers are currently **well** hidden.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Spec] The difference between source-id, field-id and column-id is unclear [iceberg]

Reply via email to