Stephen O'Donnell created HDDS-8535:
---------------------------------------
Summary: ReplicationManager: Unhealthy containers could block EC
recovery in small clusters
Key: HDDS-8535
URL: https://issues.apache.org/jira/browse/HDDS-8535
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Reporter: Stephen O'Donnell
With EC containers, if there is a small cluster of say 6 nodes with EC-3-2, a
container will require 5 nodes. If 2 containers become unhealthy,
reconstruction will be required to recover the 2 containers, but there is only
1 spare node.
This means one will get recovered, and we will have 4 "good" containers and 2
"unhealthy" and the container will remain stuck like this because unhealthy
containers are only removed once the container is has no over or under
replication.
A similar problem was resolved previously where an EC container with both over
and under replication can meet the same problem, where under replication cannot
proceed due to insufficient spare nodes. In that case, the solution was to
check for this case, and call the over-replication handler to clear up the
excess replicas. A similar solution is required here to remove some unhealthy
nodes to allow progress to be made.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]