rdblue commented on pull request #1469: URL: https://github.com/apache/iceberg/pull/1469#issuecomment-696912501
@jacques-n, you may be interested in this discussion. For a DELETE using position delete files, I think that this isn't quite correct: "data files referenced by new deletes must be still present". The logic for "no validation for delete files" applies to this case: if a data file was deleted, then it's okay to delete the row twice. The validation should be "data files referenced by new deletes must still be present or must be deleted; i.e., cannot be rewritten or overwritten." For a DELETE using equality delete files, I'm not sure that snapshot isolation is distinct. If a data file is added concurrently that has a row that is now deleted, then either that commit is first and the row _is_ deleted or the commit is later and it is appended. Either way, the operations are independent. There is no need to validate "no new potentially matching data files since we read" because there is not necessarily a read, and the delete applies to the data automatically. UPDATE with position delete files looks correct to me. UPDATE with equality delete files also looks correct, but I think it helps to think of that as UPSERT and not as UPDATE. A row that is concurrently written will have values from the last UPSERT operation. This is almost certainly from an external data source because it makes little sense to read a row, update its values, and update it using an equality delete that will delete all copies, including those written since the row was read. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
