Stephen O'Donnell created HDDS-8536:
---------------------------------------

             Summary: ReplicationManager: Unhealthy replicas could block Ratis 
containers being recovered
                 Key: HDDS-8536
                 URL: https://issues.apache.org/jira/browse/HDDS-8536
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: SCM
            Reporter: Stephen O'Donnell


In a similar way to HDDS-8535, if the cluster is small, say 4 nodes and a Ratis 
container has 2 unhealthy containers, RM will currently recover one new replia, 
leaving all 4 nodes used with 2 healthy and 2 unhealthy. As unhealthy 
containers are only removed after all over and under replication has been 
resolved, the container will remain stuck like this.

To avoid this, if there are insufficient spare nodes and also some unhealthy 
containers, then the under replication handler may need to call into the 
unhealthy handler to remove some of the unhealthy replicas to allow progress to 
be made.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to