[
https://issues.apache.org/jira/browse/HDDS-7666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649315#comment-17649315
]
Stephen O'Donnell commented on HDDS-7666:
-----------------------------------------
There are 3 parts to fixing this problem.
In ECReplicationCheckHandler, we deliberately skip adding "unrecoverable"
containers to the under replicated queue as we previously believed there was no
point in adding them. They cannot be recovered anyway. However this
decommission issue is specific to EC, so we should allow the container to make
it onto the under-replicated queue if it has decommissioning or maintenance
indexes.
In ECUnderReplicationHandle we need to check that the decommissioning indexes
are copied ok, even if the container is otherwise unrecoverable. A quick scan
of the code suggests that is OK, but we need to add a test to be sure.
In DatanodeAdminMonitorImpl, inside the method checkContainersReplicatedOnNode,
we use a call to ECContainerReplicaCount.isSufficientlyReplicated() to decide
if the container is replicated ok or not. Even if we address 1 and 2 above,
this is still a problem, as the container is un-recoverable. For EC container
in the decommission monitor, perhaps we need a different check. Ie, that for
the replica on the host being checked, it is also available on another
IN_SERVICE host. From a decommission point of view, we don't care if the entire
EC container is sufficiently replicated or not - we just care that the replica
on the current host has a copy elsewhere.
> EC: "Missing" EC containers with some remaining replicas may block
> decommissioning
> ----------------------------------------------------------------------------------
>
> Key: HDDS-7666
> URL: https://issues.apache.org/jira/browse/HDDS-7666
> Project: Apache Ozone
> Issue Type: Sub-task
> Affects Versions: 1.3.0
> Reporter: Ethan Rose
> Priority: Major
>
> In EC, a container is considered missing and under replicated if it has lost
> enough replicas that offline reconstruction is not possible. If any of the
> remaining replicas for this container are on a datanode that is being
> decommissioned, the decommissioning will not proceed. All the containers on
> that node must be restored to proper replication for it to finish
> decommissioning, but the code will not copy the replica of the missing
> container to a different node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]