Stephen O'Donnell created HDDS-8536:
---------------------------------------
Summary: ReplicationManager: Unhealthy replicas could block Ratis
containers being recovered
Key: HDDS-8536
URL: https://issues.apache.org/jira/browse/HDDS-8536
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Reporter: Stephen O'Donnell
In a similar way to HDDS-8535, if the cluster is small, say 4 nodes and a Ratis
container has 2 unhealthy containers, RM will currently recover one new replia,
leaving all 4 nodes used with 2 healthy and 2 unhealthy. As unhealthy
containers are only removed after all over and under replication has been
resolved, the container will remain stuck like this.
To avoid this, if there are insufficient spare nodes and also some unhealthy
containers, then the under replication handler may need to call into the
unhealthy handler to remove some of the unhealthy replicas to allow progress to
be made.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]