[GitHub] [iceberg] szehon-ho edited a comment on pull request #2925: Core: Support serializable isolation for ReplacePartitions

GitBox Thu, 24 Feb 2022 14:36:39 -0800


szehon-ho edited a comment on pull request #2925:
URL: https://github.com/apache/iceberg/pull/2925#issuecomment-1049253233



   @aokolnychyi about the open question of what to do for 'snapshot isolation', 
I gave it some thought:
   
   From: 
https://www.postgresql.org/docs/13/transaction-iso.html#XACT-REPEATABLE-READ. 
(which according to it is almost like "Snapshot Isolation"):
   
   > UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave 
the same as SELECT in terms of searching for target rows: they will only find 
target rows that were committed as of the transaction start time. However, such 
a target row might have already been updated (or deleted or locked) by another 
concurrent transaction by the time it is found. In this case, the repeatable 
read transaction will wait for the first updating transaction to commit or roll 
back (if it is still in progress). If the first updater rolls back, then its 
effects are negated and the repeatable read transaction can proceed with 
updating the originally found row. But if the first updater commits (and 
actually updated or deleted the row, not just locked it) then the repeatable 
read transaction will be rolled back with the message
   
   > ERROR:  could not serialize access due to concurrent update
   
   > because a repeatable read transaction cannot modify or lock rows changed 
by other transactions after the repeatable read transaction began.
   
   I think InsertOverwrite / ReplacePartitions is more like "Delete *" followed 
by "Insert * ".  So I think we may follow the steps here they describe here for 
DELETE and throw exception if any row we are to delete has been modified by the 
time we commit.  So here, if we see either of 1. delete file or 2. deleted data 
files in the partition when we commit, we throw the exception as a potential 
validation constraint.
   
   It also makes sense to me in a scenario say to use ReplacePartitions to do 
an update, ie 
   ```
   "INSERT OVERWRITE employees PARTITIONS (date = '$date') SELECT name, salary 
+ 1000, id FROM employees"
   ```
   once I read a snapshot of the row for modification, I should not be updating 
it if it has been deleted by another concurrent transaction, as it would undo 
the delete.  
   
   I think new data files is fine, as it wont affect the set of rows we mark to 
be deleted once we start.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho edited a comment on pull request #2925: Core: Support serializable isolation for ReplacePartitions

Reply via email to