szehon-ho edited a comment on pull request #2925: URL: https://github.com/apache/iceberg/pull/2925#issuecomment-1049253233
@aokolnychyi about the open question of what to do for 'snapshot isolation', I gave it some thought: From: https://www.postgresql.org/docs/13/transaction-iso.html#XACT-REPEATABLE-READ. (which according to it is almost like "Snapshot Isolation"): > UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message > ERROR: could not serialize access due to concurrent update > because a repeatable read transaction cannot modify or lock rows changed by other transactions after the repeatable read transaction began. I think InsertOverwrite / ReplacePartitions is more like "Delete *" followed by "Insert * ". So I think we may follow the steps here they describe here for DELETE and throw exception if any row we are to delete has been modified by the time we commit. So here, if we see either of 1. delete file or 2. deleted data files in the partition when we commit, we throw the exception as a potential validation constraint. It also makes sense to me in a scenario say to use ReplacePartitions to do an update, ie ``` "INSERT OVERWRITE employees PARTITIONS (date = '$date') SELECT name, salary + 1000, id FROM employees" ``` once I read a snapshot of the row for modification, I should not be updating it if it has been deleted by another concurrent transaction, as it would undo the delete. I think new data files is fine, as it wont affect the set of rows we mark to be deleted once we start. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
