errose28 commented on PR #3741:
URL: https://github.com/apache/ozone/pull/3741#issuecomment-1258673684

   I've looked through the existing code, and if the RocksDB update fails for 
schema v3 I think we will be ok with a minor improvement in the current design. 
The volume may eventually be marked unhealthy but even if it is not I think 
this error is recoverable.
   
   ## Schema < V3
   
   For schema < V3 we have the following operations:
   1. Container directory renamed to tmp directory.
   2. Container is removed from in memory container set.
   3. Container is deleted from tmp directory.
   
   We had initially designed to handle a failure in steps 1 or 3. Step 2 is in 
memory so it should fail. For schema V3, however, we have an extra operation 
since the volume's RocksDB needs to be updated.
   
   ## Schema V3
   
   For schema V3 we have the following operations:
   1. Container directory renamed to tmp directory.
   2. Container is removed from in memory container set.
   3. Container's entries are removed from RocksDB.
   4. Container is deleted from tmp directory.
   
   The original design still handles failures in steps 1 and 4 above as 
intended, but what happens if step 3 fails? The container has been removed from 
the in memory container set in step 2, so there should be any correctness 
issues while the datanode is running. On restart, the container set is rebuilt 
by scanning the available datanode volumes, and container metadata is populated 
based on the values in RocksDB. Since the container is already in the tmp 
directory, it will not be seen in this step and will not be loaded in to the 
container set, so we are still behaving correctly.
   
   This does leave some orphaned entries in the volume's RocksDB, however I 
think these should be relatively easy to clean up. On the startup/shutdown 
hooks where we are deleting the containers from the tmp directory, we can check 
if the container is schema V3. If it is, we should first delete any remaining 
RocksDB entries it may have (if they exist) and then we can delete the 
container from the tmp directory. If this RocksDB delete fails the container 
should be skipped and its delete will happen again on the next 
startup/shutdown. Since the container move to the tmp directory is atomic, we 
can check RocksDB for each schema V3 .container file we find in the tmp 
directory. If the contents of the tmp directory are partially deleted and there 
is no .container file  (maybe a failed delete from the tmp directory in the 
past) then the RocksDB entries must have already been removed so we can proceed 
to delete those contents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to