rdblue commented on pull request #1849: URL: https://github.com/apache/iceberg/pull/1849#issuecomment-736916300
My main concern with this API is that it doesn't really help with some of the major challenges of resolving conflicts. It may be that we have been talking about different parts of the problem. To me, there are two main approaches: * **Cherry-pick commits**: A transaction would keep track of different table metadata files without updating each table's metadata location. When it is time to commit, swap the locations in one atomic operation. If there is a conflict, then refresh that table, cherry-pick the commits since the base metadata file, and retry with the new metadata location. * **Retry pending updates**: A transaction would keep track of all `PendingUpdate` instances that were created to modify the tables. These can already refresh and retry the individual operation's changes. The initial writes would configure and commit each change to a transaction. The transaction would perform the atomic swap of multiple locations. If there is a conflict, then refresh each table and re-commit the changes to it (just like single-table transactions). Then retry the atomic swap. I was expecting an API that used the second approach. That's what I see as the major problem to solve: how to coordinate operations across multiple tables in a way that allows the existing `PendingUpdate` retries, like an extension of the single-table transaction API. I think that keeping each configured `PendingUpdate` around is going to be an easier way to retry because the logic for retries is already written. Trying to cherry-pick changes means needing to know much more information than is tracked currently, and would require reproducing all of the changes from metadata and applying them. I think a normal operation example is helpful: ```sql DELETE FROM db.table WHERE id = 5; ``` Let's say that only replaces one file: delete `file_a.parquet`, add `file_b.parquet`. But it will also validate that no data has been added that could match the `WHERE` clause since the operation started, because it needs to scan the table to find where the records are located. For the cherry-pick approach, we can recover the deleted file, the added file, and the parent snapshot ID. But we can't recover the filter from the `WHERE` clause, nor do we know whether the validation was enabled (due to the isolation level chosen for the change). To make the cherry-pick approach work, we would need to write that additional information into table metadata and build operations that can use it. We did this for dynamic partition overwrite because it was just a flag, but I'm concerned that it would be a lot of work to do this for all of the operations. For the `PendingUpdate` approach, each operation is fully configured and kept around, so we can simply call `commit` again to re-apply the changes in the transaction. On the other hand, maybe the way I'm thinking about this is wrong because I'm assuming that we would want to apply the same validations (i.e., no files that might match `id = 5` have been added). Maybe for multi-table transactions we only want to validate that since the starting snapshot, no data changes have been made to a table for any table that is read. That would ensure correctness, and then we could simply check that the file changes can still be applied. This would make sense because we need to add the extra validations about other table states. As long as those are coarse validations based on only snapshot ID, then maybe it is okay that the commit changes are as well. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
