[
https://issues.apache.org/jira/browse/HDDS-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan resolved HDDS-11136.
-------------------------------------
Fix Version/s: 1.5.0
Target Version/s: 1.5.0
Resolution: Fixed
> Some containers affected by HDDS-8129 may still be in the DELETING state
> incorrectly
> ------------------------------------------------------------------------------------
>
> Key: HDDS-11136
> URL: https://issues.apache.org/jira/browse/HDDS-11136
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode, SCM
> Reporter: Ethan Rose
> Assignee: Siddhant Sangwan
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.5.0
>
> Attachments: Stuck Deleting Containers.pdf
>
>
> The bug described in HDDS-8129 would cause containers to have block counts
> lower than their correct value. In versions of the code before the issue was
> fixed, this could cause the block count to reach zero too early, so SCM would
> move the containers to DELETING state, issue delete commands to datanodes,
> and move containers to DELETED when the replicas were gone. However, it's
> possible that between when the datanodes sent a heartbeat with zero block
> counts and SCM sent back delete commands, the block deleting service ran and
> made the container's block count negative on the datanode. In this case, when
> the datanode gets the delete command, it will reject it, even in the old
> version before the fixes, because the counter is [not equal to
> zero|https://github.com/apache/ozone/blob/08263b44ce1422711e1fa70797bf349e4bb3f56b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1181]
> (this code link is to a version before the deletion path was fixed).
> These containers are stuck such that SCM's state is DELETING and it keeps
> resending delete commands, but datanodes block the deletion and the container
> may still have valid data. Containers that entered this state in old versions
> have remained in this state indefinitely, even after the fixes. This is
> because the delete commands are being sent based on SCM's DELETING state for
> the container, not the status of its block content as reported by datanodes
> after the fixes. The fixes prevent containers from moving from CLOSED to
> DELETING incorrectly but do nothing for containers already in that state.
> Since DELETING containers are not processed by the replication manager, we
> need a way for SCM to move their state back to CLOSED if the datanode rejects
> the deletion to fully recover from the effects of HDDS-8129.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]