Stephen O'Donnell created HDDS-12114:
----------------------------------------
Summary: Prevent delete commands running after a long lock wait
and send ICR earlier
Key: HDDS-12114
URL: https://issues.apache.org/jira/browse/HDDS-12114
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Datanode
Reporter: Stephen O'Donnell
Assignee: Stephen O'Donnell
We have seen some instances where delete container commands are picked from the
command queue within the SCM defined deadline. However they run for a very long
time in the handler. This cases SCM to think the delete has been dropped or
failed, when it is actually still running.
The causes of the slow running command could be:
1. Something else has a lock on the container for a long time, blocking the
delete operation
2. Slow disk causing the removal of the container files to take a very long
time.
To compound this problem, an ICR confirming the delete is not sent until the
very last stage of the delete process.
To combat this, two changes are included in this Jira:
1. Introduce a lock timeout of 60 seconds. If it takes longer than this for the
lock and pre-checks to complete, the container delete is skipped.
2. Move the ICR to immediately after the point where the container is removed
from the container set. At this stage, there is no way to recover the container
without a DN restart and it makes sense to inform SCM that the container is
logically removed ASAP.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]