RussellSpitzer opened a new issue #2317:
URL: https://github.com/apache/iceberg/issues/2317


   The current logic for doing a commit is at a high level as follows
   
   1. Write all files including a new-metadata.json for the operation
   2. Acquiring a lock
   3. Swapping the pointer from old-metadata.json to new-metadata.json
   4. Releasing the lock
   
   If we fail during 3 we will always attempt to cleanup the files we created 
in step 1. This is a problem when step 3 (the swap) has been successful server 
side but the client has not received an acknowledgment. This leads to a state 
where 
   
   1. Catalog is pointing to new-metadata.json
   2. Our client is actively removing old-metadata.json and all files which 
were added for the operation
   
   Future clients which are able to contact the HMS will see the 
new-metdata.json location, attempt to read it and fail. 
   
   What's worse is that if there are multiple clients attempting to work with 
this table, there is a window of time in which a second client can read 
new-metadata.json before it is removed and can build a new-new-metadata.json 
based off of it. The new-new-metadata.json will now have references to files 
which are in the process of being removed by the first client.
   
   
   To avoid this we need to handle failures in stage 3 of table commits 
slightly differently, basically we need to group failures into two categories:
   
   1. Failures that are reports from the Catalog that it could not perform the 
operation for some reason
   2. Failures where the client has lost contact for some reason (basically 
everything else)
   
   Type 1 failures can be cleaned up, we know the commit did not succeed and 
the files we have currently generated are more or less useless.
   
   Type 2 failures must still be reported as failures, but cannot be retried 
and cannot be cleaned up. We do not know if our metadata.json is pointing to 
the new file or not and thus we cannot resolve the situation until 
communication with the HMS is restored. Since we are usually talking about a 
client perspective here, I believe the right thing to do is to fail and let the 
user know they need to manually check and clean up files if necessary.
   
   I haven't checked other catalog implementations to see if they have similar 
vulnerabilities so we should probably be checking those as well. The HMS code 
in question is 
[here](https://github.com/apache/iceberg/blob/c7ff6d51377eaae7f476e0c0730dea3b5a84fabb/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L214-L229)
   
   CC: @aokolnychyi , @karuppayya , @raptond 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to