[
https://issues.apache.org/jira/browse/HDDS-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen O'Donnell updated HDDS-12114:
-------------------------------------
Target Version/s: (was: 1.4.2)
> Prevent delete commands running after a long lock wait and send ICR earlier
> ---------------------------------------------------------------------------
>
> Key: HDDS-12114
> URL: https://issues.apache.org/jira/browse/HDDS-12114
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0, 1.4.2
>
>
> We have seen some instances where delete container commands are picked from
> the command queue within the SCM defined deadline. However they run for a
> very long time in the handler. This cases SCM to think the delete has been
> dropped or failed, when it is actually still running.
> The causes of the slow running command could be:
> 1. Something else has a lock on the container for a long time, blocking the
> delete operation
> 2. Slow disk causing the removal of the container files to take a very long
> time.
> To compound this problem, an ICR confirming the delete is not sent until the
> very last stage of the delete process.
> To combat this, two changes are included in this Jira:
> 1. Introduce a lock timeout of 60 seconds. If it takes longer than this for
> the lock and pre-checks to complete, the container delete is skipped.
> 2. Move the ICR to immediately after the point where the container is removed
> from the container set. At this stage, there is no way to recover the
> container without a DN restart and it makes sense to inform SCM that the
> container is logically removed ASAP.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]