ivandika3 opened a new pull request, #6560: URL: https://github.com/apache/ozone/pull/6560
## What changes were proposed in this pull request? It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog will not retry at all, and the datanode will trigger a pipeline failure to close the pipeline. This might cause a lot of pipeline close events sent by the datanodes during high IO events. Our cluster encountered this issue which caused pipeline thrashing issues (pipeline kept getting closed and created continuously). The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties and setStateMachineDataConfigurations which causes an issue. Need to fix the ordering so that it's the syncTimeoutRetry is calculated correctly (default 30 times). ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-10717 ## How was this patch tested? Clean CI: https://github.com/ivandika3/ozone/actions/runs/8752866026 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
