[jira] [Commented] (HDDS-7666) EC: "Missing" EC containers with some remaining replicas may block decommissioning

Stephen O'Donnell (Jira) Mon, 19 Dec 2022 04:48:08 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-7666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649315#comment-17649315
 ]


Stephen O'Donnell commented on HDDS-7666:
-----------------------------------------

There are 3 parts to fixing this problem. 

In ECReplicationCheckHandler, we deliberately skip adding "unrecoverable" 
containers to the under replicated queue as we previously believed there was no 
point in adding them. They cannot be recovered anyway. However this 
decommission issue is specific to EC, so we should allow the container to make 
it onto the under-replicated queue if it has decommissioning or maintenance 
indexes.

In ECUnderReplicationHandle we need to check that the decommissioning indexes 
are copied ok, even if the container is otherwise unrecoverable. A quick scan 
of the code suggests that is OK, but we need to add a test to be sure.

In DatanodeAdminMonitorImpl, inside the method checkContainersReplicatedOnNode, 
we use a call to ECContainerReplicaCount.isSufficientlyReplicated() to decide 
if the container is replicated ok or not. Even if we address 1 and 2 above, 
this is still a problem, as the container is un-recoverable. For EC container 
in the decommission monitor, perhaps we need a different check. Ie, that for 
the replica on the host being checked, it is also available on another 
IN_SERVICE host. From a decommission point of view, we don't care if the entire 
EC container is sufficiently replicated or not - we just care that the replica 
on the current host has a copy elsewhere.

> EC: "Missing" EC containers with some remaining replicas may block 
> decommissioning
> ----------------------------------------------------------------------------------
>
>                 Key: HDDS-7666
>                 URL: https://issues.apache.org/jira/browse/HDDS-7666
>             Project: Apache Ozone
>          Issue Type: Sub-task
>    Affects Versions: 1.3.0
>            Reporter: Ethan Rose
>            Priority: Major
>
> In EC, a container is considered missing and under replicated if it has lost 
> enough replicas that offline reconstruction is not possible. If any of the 
> remaining replicas for this container are on a datanode that is being 
> decommissioned, the decommissioning will not proceed. All the containers on 
> that node must be restored to proper replication for it to finish 
> decommissioning, but the code will not copy the replica of the missing 
> container to a different node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7666) EC: "Missing" EC containers with some remaining replicas may block decommissioning

Reply via email to