[jira] [Updated] (HDDS-9383) ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning

Janus Chow (Jira) Sat, 16 Dec 2023 15:45:13 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Janus Chow updated HDDS-9383:
-----------------------------
    Target Version/s: 1.5.0  (was: 1.4.0)

I am managing the 1.4.0 release and we currently have more than 500 issues 
targeted for 1.4.0. I am moving the target field to 1.5.0.
       
If you are actively working on this jira and believe this should be targeted 
for the 1.4.0 release, Please reach out to me via Apache email or Slack.

> ReplicationManager: Unhealthy replicas of a sufficiently replicated container 
> can block decommissioning
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-9383
>                 URL: https://issues.apache.org/jira/browse/HDDS-9383
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Siddhant Sangwan
>            Assignee: Siddhant Sangwan
>            Priority: Critical
>
> Mix of quasi-closed and unhealthy replicas blocks decommission even if 
> sufficiently replicated.
> a. Caused when only some of the replicas hit the error during write.
> b. Can be fixed by removing this check:
> if (!replicaSet.isHealthy()) {
>           if (LOG.isDebugEnabled()) {
>             unhealthyIDs.add(cid);
>           }
>           if (unhealthy < CONTAINER_DETAILS_LOGGING_LIMIT
> However, simply removing that check is not a complete solution. We need to 
> try and preserve any UNHEALTHY replicas that have the greatest Sequence ID. 
> https://issues.apache.org/jira/browse/HDDS-9321 takes care of the Legacy 
> Replication Manager side of things to preserve such UNHEALTHY replicas. It 
> introduces an API, {{getVulnerableUnhealthyReplicas}}, in 
> {{RatisContainerReplicaCount}}. In the new RM, we need to see if it's 
> possible to leverage this API. We will also require some decommissioning side 
> changes, like in https://issues.apache.org/jira/browse/HDDS-9354.
> The approach described above indirectly tries to fix this issue by moving 
> replicas around. A more complete, long term fix can be to have a 
> reconciliation job that fixes these UNHEALTHY replicas on the datanode, 
> possibly by merging blocks from different replicas to get a healthy replica. 
> We should also try to investigate how a quasi-closed container is getting 
> some unhealthy replicas and fix the root cause.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-9383) ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning

Reply via email to