[ 
https://issues.apache.org/jira/browse/HDDS-10717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-10717:
-------------------------------
    Description: 
It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog 
will not retry at all, and the datanode will trigger a pipeline failure to 
close the pipeline. This might cause a lot of pipeline close events sent by the 
datanodes during high IO events. Our cluster encountered this issue which 
caused a pipeline thrashing issue (pipeline kept getting closed and created 
continuously).

The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties 
and setStateMachineDataConfigurations which causes an issue.

Need to fix the ordering so that it's the syncTimeoutRetry is calculated 
correctly (default 30 times).

  was:
It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog 
will not retry at all, and the datanode will trigger a pipeline failure to 
close the pipeline. This might explain why there are a lot of pipeline close 
events sent by the datanodes during high IO events.

The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties 
and setStateMachineDataConfigurations which causes an issue.

Need to fix the ordering so that it's the syncTimeoutRetry is calculated 
correctly (default 30 times).


> nodeFailureTimeoutMs should be initialized before syncTimeoutRetry
> ------------------------------------------------------------------
>
>                 Key: HDDS-10717
>                 URL: https://issues.apache.org/jira/browse/HDDS-10717
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: DN, Ozone Datanode
>    Affects Versions: 1.4.0
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog 
> will not retry at all, and the datanode will trigger a pipeline failure to 
> close the pipeline. This might cause a lot of pipeline close events sent by 
> the datanodes during high IO events. Our cluster encountered this issue which 
> caused a pipeline thrashing issue (pipeline kept getting closed and created 
> continuously).
> The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties 
> and setStateMachineDataConfigurations which causes an issue.
> Need to fix the ordering so that it's the syncTimeoutRetry is calculated 
> correctly (default 30 times).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to