openinx commented on pull request #2354: URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809033279
I think we should keep trace of multiple version in apache iceberg schema, let's discuss the case you described: adding profile_id to table previously identified by only account_id. t1: User defines the `account_id` as the row identifier; t2: Write few records into the table; t3: Write few equality deletions (by `account_id`) into table; t4: Adding `profile_id` to row identifier, now the identifier is `account_id` & `profile_id`; t5: Write few equality deletions ( by `account_id` & `profile_id`) into table; In my option, the iceberg table format's row identifier specification is introduced because we expect the standard SQL's `PRIMARY KEY` could be mapped to those row identifier columns automatically ( if we don't have the row identifier spec then we don't know how to track those keys when create table like: ```sql CREATE TABLE sample(id INT, data STRING, PRIMARY KEY (id) NOT ENFORCED); ``` ) Back to the above case, at the timestamp `t4` & `t5`, the table's row identifier is `account_id` & `profile_id`. If people want to read the snapshot at timestamp `t3`, then we should use the row identifier `account_id`. So if we don't track the multiple version of identifier, How could we read the row identifier from old snapshots ? If use the latest `account_id` & `profile_id`, that seems confuse people a lot because those rows are deleted only by field `account_id` actually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
