[jira] [Updated] (HDDS-9821) XceiverServerRatis SyncTimeoutRetry is overridden

Ivan Andika (Jira) Fri, 01 Dec 2023 22:46:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-9821:
------------------------------
    Description: 
In XceiverServerRatis#newRaftProperties, setSyncTimeoutRetry was set twice

First, it is set to 
{code:java}
(int) nodeFailureTimeoutMs / dataSyncTimeout.toIntExact(TimeUnit.MILLISECONDS) 
{code}
which by default equals to 300_000 ms / 10_000 ms  =  30 retries

>From the comment, the intention of setting a finite number of retries is:

Even if the leader is not able to complete write calls within the timeout 
seconds, it should just fail the operation and trigger pipeline close. failing 
the writeStateMachine call with limited retries will ensure even the leader 
initiates a pipeline close if its not able to complete write in the timeout 
configured.

However, it was overridden in 
{code:java}
int numSyncRetries = conf.getInt(
    OzoneConfigKeys.DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES,
    OzoneConfigKeys.
        DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES_DEFAULT);
RaftServerConfigKeys.Log.StateMachineData.setSyncTimeoutRetry(properties,
    numSyncRetries); {code}
Which set it to the default value -1 (retry indefinitely). 

This might cause the leader to never initiate a pipeline close when its 
writeStateMachine time out (e.g. write chunk timeout due to I/O issue).

I propose we use the finite timeout retry and drop the setSyncTimeoutRetry 
configuration.

This is a good avenue to re-evaluate the state machine data policy in Container 
State Machine.

  was:
In XceiverServerRatis#newRaftProperties, setSyncTimeoutRetry was set twice

First, it is set to 
{code:java}
(int) nodeFailureTimeoutMs / dataSyncTimeout.toIntExact(TimeUnit.MILLISECONDS) 
{code}
which by default equals to 300_000 ms / 10_000 ms  =  30 retries

>From the comment, the intention of setting a finite number of retries is:

Even if the leader is not able to complete write calls within the timeout 
seconds, it should just fail the operation and trigger pipeline close. failing 
the writeStateMachine call with limited retries will ensure even the leader 
initiates a pipeline close if its not able to complete write in the timeout 
configured.

However, it was overridden in 
{code:java}
int numSyncRetries = conf.getInt(
    OzoneConfigKeys.DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES,
    OzoneConfigKeys.
        DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES_DEFAULT);
RaftServerConfigKeys.Log.StateMachineData.setSyncTimeoutRetry(properties,
    numSyncRetries); {code}
Which set it to the default value -1 (retry indefinitely). 

This might cause the leader to never initiate a pipeline close when its 
writeStateMachine time out (e.g. due to I/O issue).

I propose we use the finite timeout retry and drop the setSyncTimeoutRetry 
configuration.

This is a good avenue to re-evaluate the state machine data policy in Container 
State Machine.


> XceiverServerRatis SyncTimeoutRetry is overridden 
> --------------------------------------------------
>
>                 Key: HDDS-9821
>                 URL: https://issues.apache.org/jira/browse/HDDS-9821
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>             Fix For: 1.4.0
>
>
> In XceiverServerRatis#newRaftProperties, setSyncTimeoutRetry was set twice
> First, it is set to 
> {code:java}
> (int) nodeFailureTimeoutMs / 
> dataSyncTimeout.toIntExact(TimeUnit.MILLISECONDS) {code}
> which by default equals to 300_000 ms / 10_000 ms  =  30 retries
> From the comment, the intention of setting a finite number of retries is:
> Even if the leader is not able to complete write calls within the timeout 
> seconds, it should just fail the operation and trigger pipeline close. 
> failing the writeStateMachine call with limited retries will ensure even the 
> leader initiates a pipeline close if its not able to complete write in the 
> timeout configured.
> However, it was overridden in 
> {code:java}
> int numSyncRetries = conf.getInt(
>     OzoneConfigKeys.DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES,
>     OzoneConfigKeys.
>         DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES_DEFAULT);
> RaftServerConfigKeys.Log.StateMachineData.setSyncTimeoutRetry(properties,
>     numSyncRetries); {code}
> Which set it to the default value -1 (retry indefinitely). 
> This might cause the leader to never initiate a pipeline close when its 
> writeStateMachine time out (e.g. write chunk timeout due to I/O issue).
> I propose we use the finite timeout retry and drop the setSyncTimeoutRetry 
> configuration.
> This is a good avenue to re-evaluate the state machine data policy in 
> Container State Machine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-9821) XceiverServerRatis SyncTimeoutRetry is overridden

Reply via email to