szehon-ho edited a comment on pull request #2925:
URL: https://github.com/apache/iceberg/pull/2925#issuecomment-1049253233


   @aokolnychyi about the open question of what to do for 'snapshot isolation', 
I gave it some thought:
   
   From: 
https://www.postgresql.org/docs/13/transaction-iso.html#XACT-REPEATABLE-READ. 
(which according to it is almost like "Snapshot Isolation"):
   
   > UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave 
the same as SELECT in terms of searching for target rows: they will only find 
target rows that were committed as of the transaction start time. However, such 
a target row might have already been updated (or deleted or locked) by another 
concurrent transaction by the time it is found. In this case, the repeatable 
read transaction will wait for the first updating transaction to commit or roll 
back (if it is still in progress). If the first updater rolls back, then its 
effects are negated and the repeatable read transaction can proceed with 
updating the originally found row. But if the first updater commits (and 
actually updated or deleted the row, not just locked it) then the repeatable 
read transaction will be rolled back with the message
   
   > ERROR:  could not serialize access due to concurrent update
   
   > because a repeatable read transaction cannot modify or lock rows changed 
by other transactions after the repeatable read transaction began.
   
   I think "insert overwrite" / ReplacePartitions is more like "Delete *" 
followed by "Insert * ".  So I think we may follow the steps here they describe 
here for DELETE and throw exception if any row we are to delete has been 
modified.  So here, if we see either of 1. delete file or 2. deleted data files 
in the partition, we throw the exception as a potential validation constraint.
   
   It also makes sense to me from the definition of "Snapshot Isolation": 
   
   > snapshot isolation is a guarantee that all reads made in a transaction 
will see a consistent snapshot of the database
   
   Take a scenario where I use ReplacePartitions to do an update, ie 
   ```
   "INSERT OVERWRITE table PARTITIONS (date = '$date') SELECT table.salary + 
1000 FROM table WHERE date='date'"
   ```
   once I read a snapshot of the row for modification, I should not be updating 
it if it has been modified since (ie, if we see a possibly related delete file 
or deleted data file in our case when we commit).
   
   I think new data files is fine, as once we choose the set of rows to delete, 
they will not affect those.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to