[
https://issues.apache.org/jira/browse/HDDS-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen O'Donnell resolved HDDS-8535.
-------------------------------------
Fix Version/s: 1.4.0
Resolution: Fixed
> ReplicationManager: Unhealthy containers could block EC recovery in small
> clusters
> ----------------------------------------------------------------------------------
>
> Key: HDDS-8535
> URL: https://issues.apache.org/jira/browse/HDDS-8535
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM
> Reporter: Stephen O'Donnell
> Assignee: Siddhant Sangwan
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.4.0
>
>
> With EC containers, if there is a small cluster of say 6 nodes with EC-3-2, a
> container will require 5 nodes. If 2 containers become unhealthy,
> reconstruction will be required to recover the 2 containers, but there is
> only 1 spare node.
> This means one will get recovered, and we will have 4 "good" containers and 2
> "unhealthy" and the container will remain stuck like this because unhealthy
> containers are only removed once the container is has no over or under
> replication.
> A similar problem was resolved previously where an EC container with both
> over and under replication can meet the same problem, where under replication
> cannot proceed due to insufficient spare nodes. In that case, the solution
> was to check for this case, and call the over-replication handler to clear up
> the excess replicas. A similar solution is required here to remove some
> unhealthy nodes to allow progress to be made.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]