RussellSpitzer opened a new issue #2317: URL: https://github.com/apache/iceberg/issues/2317
The current logic for doing a commit is at a high level as follows 1. Write all files including a new-metadata.json for the operation 2. Acquiring a lock 3. Swapping the pointer from old-metadata.json to new-metadata.json 4. Releasing the lock If we fail during 3 we will always attempt to cleanup the files we created in step 1. This is a problem when step 3 (the swap) has been successful server side but the client has not received an acknowledgment. This leads to a state where 1. Catalog is pointing to new-metadata.json 2. Our client is actively removing old-metadata.json and all files which were added for the operation Future clients which are able to contact the HMS will see the new-metdata.json location, attempt to read it and fail. What's worse is that if there are multiple clients attempting to work with this table, there is a window of time in which a second client can read new-metadata.json before it is removed and can build a new-new-metadata.json based off of it. The new-new-metadata.json will now have references to files which are in the process of being removed by the first client. To avoid this we need to handle failures in stage 3 of table commits slightly differently, basically we need to group failures into two categories: 1. Failures that are reports from the Catalog that it could not perform the operation for some reason 2. Failures where the client has lost contact for some reason (basically everything else) Type 1 failures can be cleaned up, we know the commit did not succeed and the files we have currently generated are more or less useless. Type 2 failures must still be reported as failures, but cannot be retried and cannot be cleaned up. We do not know if our metadata.json is pointing to the new file or not and thus we cannot resolve the situation until communication with the HMS is restored. Since we are usually talking about a client perspective here, I believe the right thing to do is to fail and let the user know they need to manually check and clean up files if necessary. I haven't checked other catalog implementations to see if they have similar vulnerabilities so we should probably be checking those as well. The HMS code in question is [here](https://github.com/apache/iceberg/blob/c7ff6d51377eaae7f476e0c0730dea3b5a84fabb/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L214-L229) CC: @aokolnychyi , @karuppayya , @raptond ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
