Stephen O'Donnell created HDDS-8535:
---------------------------------------

             Summary: ReplicationManager: Unhealthy containers could block EC 
recovery in small clusters
                 Key: HDDS-8535
                 URL: https://issues.apache.org/jira/browse/HDDS-8535
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: SCM
            Reporter: Stephen O'Donnell


With EC containers, if there is a small cluster of say 6 nodes with EC-3-2, a 
container will require 5 nodes. If 2 containers become unhealthy, 
reconstruction will be required to recover the 2 containers, but there is only 
1 spare node.

This means one will get recovered, and we will have 4 "good" containers and 2 
"unhealthy" and the container will remain stuck like this because unhealthy 
containers are only removed once the container is has no over or under 
replication.

A similar problem was resolved previously where an EC container with both over 
and under replication can meet the same problem, where under replication cannot 
proceed due to insufficient spare nodes. In that case, the solution was to 
check for this case, and call the over-replication handler to clear up the 
excess replicas. A similar solution is required here to remove some unhealthy 
nodes to allow progress to be made.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to