sodonnel opened a new pull request, #7726:
URL: https://github.com/apache/ozone/pull/7726

   ## What changes were proposed in this pull request?
   
   We have seen some instances where delete container commands are picked from 
the command queue within the SCM defined deadline. However they run for a very 
long time in the handler. This cases SCM to think the delete has been dropped 
or failed, when it is actually still running.
   
   The causes of the slow running command could be:
   
   1. Something else has a lock on the container for a long time, blocking the 
delete operation
   2. Slow disk causing the removal of the container files to take a very long 
time.
   
   To compound this problem, an ICR confirming the delete is not sent until the 
very last stage of the delete process.
   
   To combat this, two changes are included in this PR:
   
   1. Introduce a lock timeout of 60 seconds. If it takes longer than this for 
the lock and pre-checks to complete, the container delete is skipped.
   2. Move the ICR to immediately after the point where the container is 
removed from the container set. At this stage, there is no way to recover the 
container without a DN restart and it makes sense to inform SCM that the 
container is logically removed ASAP.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-12114
   
   ## How was this patch tested?
   
   New unit test added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to