ivandika3 opened a new pull request, #6560:
URL: https://github.com/apache/ozone/pull/6560

   ## What changes were proposed in this pull request?
   
   It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog 
will not retry at all, and the datanode will trigger a pipeline failure to 
close the pipeline. This might cause a lot of pipeline close events sent by the 
datanodes during high IO events. Our cluster encountered this issue which 
caused pipeline thrashing issues (pipeline kept getting closed and created 
continuously).
   
   The issue was due to nodeFailureTimeoutMs initialized after 
newRaftProperties and setStateMachineDataConfigurations which causes an issue.
   
   Need to fix the ordering so that it's the syncTimeoutRetry is calculated 
correctly (default 30 times).
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-10717
   
   ## How was this patch tested?
   
   Clean CI: https://github.com/ivandika3/ozone/actions/runs/8752866026
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to