[jira] [Resolved] (HDDS-11136) Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly

Siddhant Sangwan (Jira) Mon, 29 Jul 2024 00:50:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siddhant Sangwan resolved HDDS-11136.
-------------------------------------
       Fix Version/s: 1.5.0
    Target Version/s: 1.5.0
          Resolution: Fixed

> Some containers affected by HDDS-8129 may still be in the DELETING state 
> incorrectly
> ------------------------------------------------------------------------------------
>
>                 Key: HDDS-11136
>                 URL: https://issues.apache.org/jira/browse/HDDS-11136
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode, SCM
>            Reporter: Ethan Rose
>            Assignee: Siddhant Sangwan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>         Attachments: Stuck Deleting Containers.pdf
>
>
> The bug described in HDDS-8129 would cause containers to have block counts 
> lower than their correct value. In versions of the code before the issue was 
> fixed, this could cause the block count to reach zero too early, so SCM would 
> move the containers to DELETING state, issue delete commands to datanodes, 
> and move containers to DELETED when the replicas were gone. However, it's 
> possible that between when the datanodes sent a heartbeat with zero block 
> counts and SCM sent back delete commands, the block deleting service ran and 
> made the container's block count negative on the datanode. In this case, when 
> the datanode gets the delete command, it will reject it, even in the old 
> version before the fixes, because the counter is [not equal to 
> zero|https://github.com/apache/ozone/blob/08263b44ce1422711e1fa70797bf349e4bb3f56b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1181]
>  (this code link is to a version before the deletion path was fixed).
> These containers are stuck such that SCM's state is DELETING and it keeps 
> resending delete commands, but datanodes block the deletion and the container 
> may still have valid data. Containers that entered this state in old versions 
> have remained in this state indefinitely, even after the fixes. This is 
> because the delete commands are being sent based on SCM's DELETING state for 
> the container, not the status of its block content as reported by datanodes 
> after the fixes. The fixes prevent containers from moving from CLOSED to 
> DELETING incorrectly but do nothing for containers already in that state.
> Since DELETING containers are not processed by the replication manager, we 
> need a way for SCM to move their state back to CLOSED if the datanode rejects 
> the deletion to fully recover from the effects of HDDS-8129.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HDDS-11136) Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly

Reply via email to