rdblue commented on pull request #2354: URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811486171
> My question is: after reverting the table to t3, should people still see the incorrect row identifier (account_id, profile_id) by default or people should see the correct row identifier (account_id) by default ? I think I see the miscommunication. I don't think there is a way to roll back to t3. There is a snapshot created at t2, t3, and t5. Those snapshots are accessible via time travel and rollback. The rest of the table metadata is independent so rolling back doesn't change it. To revert both the bad write and the configuration change, the user should roll back and then set the row identifier fields to just `account_id`. Keeping table metadata and data separate (and only versioning data) is the right behavior, I think. Data is constantly evolving and we don't want to accidentally revert metadata changes -- like updating table properties -- when the data snapshot is rolled back. Consider a slightly different scenario where the rollback to t3 was needed because the source was producing bad data. Why should the `profile_id` be removed from the row identifier in that case? If Iceberg did that implicitly, then after the corrected data is turned back on, Iceberg would start deleting rows incorrectly using the wrong key. I think the right approach is to keep data a separate dimension. Since we want Iceberg to be a coordination layer between multiple services that don't know about one another, I think it would be bad for actions that fix data to also make possibly unknown changes to metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
