errose28 commented on PR #3741: URL: https://github.com/apache/ozone/pull/3741#issuecomment-1258673684
I've looked through the existing code, and if the RocksDB update fails for schema v3 I think we will be ok with a minor improvement in the current design. The volume may eventually be marked unhealthy but even if it is not I think this error is recoverable. ## Schema < V3 For schema < V3 we have the following operations: 1. Container directory renamed to tmp directory. 2. Container is removed from in memory container set. 3. Container is deleted from tmp directory. We had initially designed to handle a failure in steps 1 or 3. Step 2 is in memory so it should fail. For schema V3, however, we have an extra operation since the volume's RocksDB needs to be updated. ## Schema V3 For schema V3 we have the following operations: 1. Container directory renamed to tmp directory. 2. Container is removed from in memory container set. 3. Container's entries are removed from RocksDB. 4. Container is deleted from tmp directory. The original design still handles failures in steps 1 and 4 above as intended, but what happens if step 3 fails? The container has been removed from the in memory container set in step 2, so there should be any correctness issues while the datanode is running. On restart, the container set is rebuilt by scanning the available datanode volumes, and container metadata is populated based on the values in RocksDB. Since the container is already in the tmp directory, it will not be seen in this step and will not be loaded in to the container set, so we are still behaving correctly. This does leave some orphaned entries in the volume's RocksDB, however I think these should be relatively easy to clean up. On the startup/shutdown hooks where we are deleting the containers from the tmp directory, we can check if the container is schema V3. If it is, we should first delete any remaining RocksDB entries it may have (if they exist) and then we can delete the container from the tmp directory. If this RocksDB delete fails the container should be skipped and its delete will happen again on the next startup/shutdown. Since the container move to the tmp directory is atomic, we can check RocksDB for each schema V3 .container file we find in the tmp directory. If the contents of the tmp directory are partially deleted and there is no .container file (maybe a failed delete from the tmp directory in the past) then the RocksDB entries must have already been removed so we can proceed to delete those contents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
