Xushaohong opened a new pull request, #3657: URL: https://github.com/apache/ozone/pull/3657
## What changes were proposed in this pull request? **Background:** The async write is still not robust enough, sometimes there will be some uncoverable containers (no healthy replicas) when the cluster load is too high. Currently, such an unrecoverable ratis container will go through the following process. - DN will mark the container as unhealthy and report it to the SCM. - SCM then tries to close the container, and the container state will be closing. - DN won't close an unhealthy replica. - SCM RM will not send close cmd to those unhealthy containers. Hence, the unrecoverable container will be stuck in the state of Closing. After the admin fixes some available data in such containers or just abandons them, these containers shall be closed on purpose. Under such circumstances, we shall provide a configurable way to clean up these closed containers. After closing the unhealthy container, the unrecoverable container with only unhealthy replicas could be deleted. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-7099 ## How was this patch tested? UT and in production env -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
