DaveTeng0 opened a new pull request, #5530: URL: https://github.com/apache/ozone/pull/5530
## What changes were proposed in this pull request? Currently it is possible that a long lived client can add most or all nodes of a small cluster to its exclude list, and further writes using that client instance will fail. This PR add a timeout for RATIS key to remove datanodes from the exclude list so that they can be retried later. (For EC key, this mechanism exists and is configured to 10 minutes by default.) This improvement is especially relevant for S3 gateway, which uses a persistent Ozone client to connect to the cluster. And there is another part of improvement that allows the write to fall back to datanodes in the exclude list if that is all available. This would be implemented as a retry from the client based on the server's initial error response. This part of work is separated into another PR: https://github.com/apache/ozone/pull/5514 ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-9323 ## How was this patch tested? unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
