[
https://issues.apache.org/jira/browse/HDDS-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan resolved HDDS-9383.
------------------------------------
Target Version/s: 1.4.0, 1.5.0 (was: 1.5.0)
Resolution: Done
> ReplicationManager: Unhealthy replicas of a sufficiently replicated container
> can block decommissioning
> -------------------------------------------------------------------------------------------------------
>
> Key: HDDS-9383
> URL: https://issues.apache.org/jira/browse/HDDS-9383
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM
> Reporter: Siddhant Sangwan
> Assignee: Siddhant Sangwan
> Priority: Critical
>
> Mix of quasi-closed and unhealthy replicas blocks decommission even if
> sufficiently replicated.
> a. Caused when only some of the replicas hit the error during write.
> b. Can be fixed by removing this check:
> if (!replicaSet.isHealthy()) {
> if (LOG.isDebugEnabled()) {
> unhealthyIDs.add(cid);
> }
> if (unhealthy < CONTAINER_DETAILS_LOGGING_LIMIT
> However, simply removing that check is not a complete solution. We need to
> try and preserve any UNHEALTHY replicas that have the greatest Sequence ID.
> https://issues.apache.org/jira/browse/HDDS-9321 takes care of the Legacy
> Replication Manager side of things to preserve such UNHEALTHY replicas. It
> introduces an API, {{getVulnerableUnhealthyReplicas}}, in
> {{RatisContainerReplicaCount}}. In the new RM, we need to see if it's
> possible to leverage this API. We will also require some decommissioning side
> changes, like in https://issues.apache.org/jira/browse/HDDS-9354.
> The approach described above indirectly tries to fix this issue by moving
> replicas around. A more complete, long term fix can be to have a
> reconciliation job that fixes these UNHEALTHY replicas on the datanode,
> possibly by merging blocks from different replicas to get a healthy replica.
> We should also try to investigate how a quasi-closed container is getting
> some unhealthy replicas and fix the root cause.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]