snazy commented on pull request #3257: URL: https://github.com/apache/iceberg/pull/3257#issuecomment-943176314
@rdblue Sorry for the late reply. Yes, this one changes the relationship between Nessie and Iceberg metadata. TL;DR the changes shall ensure that changes against the same table on different branches can later be merged together without having duplicate column-IDs or partition-IDs or the like. You're right, initially (in the "early Nessie days"), every Nessie commit held a pointer to the table-metadata. This works fine until you reference the same table on different branches and perform e.g. schema changes (think: ALTER TABLE ADD COLUMN) on both branches, which leads to duplicate column-ids (in other words: same column id used for different columns), which then can lead to data corruption when branches get merged. So the initial approach in this PR was to maintain table-metadata across all branches and "just" reference the snapshot-ID from in Nessie commits, which led to other issues (not explaining it here further, but schema changes became an issue again). The current approach is more like the initial approach: have the pointer to table-metadata in Nessie commits but track state that's important across all Nessie branches (e.g. last-column-ID) globally. It's currently implemented via additional functionality in TableMetadata to retrieve the "global state" (last-column-ID, last-used-partition-ID, last-assigned-sequence-ID) as an object that's opaque to Nessie plus functionality to update a TableMetadata using that "global state". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
