[jira] [Resolved] (HDDS-9383) ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning

Siddhant Sangwan (Jira) Tue, 19 Dec 2023 23:25:06 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siddhant Sangwan resolved HDDS-9383.
------------------------------------
    Target Version/s: 1.4.0, 1.5.0  (was: 1.5.0)
          Resolution: Done

> ReplicationManager: Unhealthy replicas of a sufficiently replicated container 
> can block decommissioning
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-9383
>                 URL: https://issues.apache.org/jira/browse/HDDS-9383
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Siddhant Sangwan
>            Assignee: Siddhant Sangwan
>            Priority: Critical
>
> Mix of quasi-closed and unhealthy replicas blocks decommission even if 
> sufficiently replicated.
> a. Caused when only some of the replicas hit the error during write.
> b. Can be fixed by removing this check:
> if (!replicaSet.isHealthy()) {
>           if (LOG.isDebugEnabled()) {
>             unhealthyIDs.add(cid);
>           }
>           if (unhealthy < CONTAINER_DETAILS_LOGGING_LIMIT
> However, simply removing that check is not a complete solution. We need to 
> try and preserve any UNHEALTHY replicas that have the greatest Sequence ID. 
> https://issues.apache.org/jira/browse/HDDS-9321 takes care of the Legacy 
> Replication Manager side of things to preserve such UNHEALTHY replicas. It 
> introduces an API, {{getVulnerableUnhealthyReplicas}}, in 
> {{RatisContainerReplicaCount}}. In the new RM, we need to see if it's 
> possible to leverage this API. We will also require some decommissioning side 
> changes, like in https://issues.apache.org/jira/browse/HDDS-9354.
> The approach described above indirectly tries to fix this issue by moving 
> replicas around. A more complete, long term fix can be to have a 
> reconciliation job that fixes these UNHEALTHY replicas on the datanode, 
> possibly by merging blocks from different replicas to get a healthy replica. 
> We should also try to investigate how a quasi-closed container is getting 
> some unhealthy replicas and fix the root cause.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HDDS-9383) ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning

Reply via email to