Stephen O'Donnell created HDDS-12114:
----------------------------------------

             Summary: Prevent delete commands running after a long lock wait 
and send ICR earlier
                 Key: HDDS-12114
                 URL: https://issues.apache.org/jira/browse/HDDS-12114
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


We have seen some instances where delete container commands are picked from the 
command queue within the SCM defined deadline. However they run for a very long 
time in the handler. This cases SCM to think the delete has been dropped or 
failed, when it is actually still running.

The causes of the slow running command could be:

1. Something else has a lock on the container for a long time, blocking the 
delete operation
2. Slow disk causing the removal of the container files to take a very long 
time.

To compound this problem, an ICR confirming the delete is not sent until the 
very last stage of the delete process.

To combat this, two changes are included in this Jira:

1. Introduce a lock timeout of 60 seconds. If it takes longer than this for the 
lock and pre-checks to complete, the container delete is skipped.
2. Move the ICR to immediately after the point where the container is removed 
from the container set. At this stage, there is no way to recover the container 
without a DN restart and it makes sense to inform SCM that the container is 
logically removed ASAP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to