[ 
https://issues.apache.org/jira/browse/HDDS-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-9823:
------------------------------
    Description: 
XceiverServerRatis#handlePipelineFailure is called in CSM failure scenarios
 * XceiverServerRatis#handleNodeSlowness
 ** From StateMachine#notifyFollowerSlowness 
 ** Set to hdds.ratis.rpc.slowness.timeout (default value 300s)
 *** Note: Ratis default value is 60s
 * XceiverServerRatis#handleNoLeader
 ** From StateMachine#notifyExtendedNoLeader
 ** Set to hdds.ratis.notification.no-leader.timeout (default value 300s)
 *** Note: Ratis default value is 60s
 * XceiverServerRatis#handleInstallSnapshotFromLeader
 ** From StateMachine#notifyInstallSnapshotFromLeader

Currently, XceiverServerRatis#handlePipelineFailure does not trigger Heartbeat 
to SCM immediately. Instead, it waits until the next heartbeat (default 60s) to 
send the pipeline close action command. This might cause SCM to still allocate 
blocks to these "failed" pipelines during this duration which might impact on 
client writing to these blocks.

To minimize the impact on the client and the datanodes on the failed pipeline. 
I suggest that the datanode trigger the pipeline close command immediately for 
every pipeline action close command triggered due to pipeline failure.

  was:
XceiverServerRatis#handlePipelineFailure is called in CSM failure scenarios
 * XceiverServerRatis#handleNodeSlowness
 ** From StateMachine#notifyFollowerSlowness 
 ** Set to hdds.ratis.rpc.slowness.timeout (default value 300s)
 *** Note: Ratis default value is 60s
 * XceiverServerRatis#handleNoLeader
 ** From StateMachine#notifyExtendedNoLeader
 ** Set to hdds.ratis.notification.no-leader.timeout (default value 300s)
 *** Note: Ratis default value is 60s
 * XceiverServerRatis#handleInstallSnapshotFromLeader
 ** From StateMachine#notifyInstallSnapshotFromLeader

The possible issue is that XceiverServerRatis#handlePipelineFailure does not 
trigger Heartbeat to SCM immediately. Instead, it waits until the next 
heartbeat (default 60s) to send the pipeline close action command. This might 
cause SCM to still allocate blocks to these "failed" pipelines during this 
duration which might impact on client writing to these blocks.

To minimize the impact on the client and the datanodes on the failed pipeline. 
I suggest that the datanode trigger the pipeline close command immediately for 
every pipeline action close command triggered due to pipeline failure.


> Pipeline failure should trigger heartbeat immediately
> -----------------------------------------------------
>
>                 Key: HDDS-9823
>                 URL: https://issues.apache.org/jira/browse/HDDS-9823
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode, SCM
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>              Labels: pull-request-available
>
> XceiverServerRatis#handlePipelineFailure is called in CSM failure scenarios
>  * XceiverServerRatis#handleNodeSlowness
>  ** From StateMachine#notifyFollowerSlowness 
>  ** Set to hdds.ratis.rpc.slowness.timeout (default value 300s)
>  *** Note: Ratis default value is 60s
>  * XceiverServerRatis#handleNoLeader
>  ** From StateMachine#notifyExtendedNoLeader
>  ** Set to hdds.ratis.notification.no-leader.timeout (default value 300s)
>  *** Note: Ratis default value is 60s
>  * XceiverServerRatis#handleInstallSnapshotFromLeader
>  ** From StateMachine#notifyInstallSnapshotFromLeader
> Currently, XceiverServerRatis#handlePipelineFailure does not trigger 
> Heartbeat to SCM immediately. Instead, it waits until the next heartbeat 
> (default 60s) to send the pipeline close action command. This might cause SCM 
> to still allocate blocks to these "failed" pipelines during this duration 
> which might impact on client writing to these blocks.
> To minimize the impact on the client and the datanodes on the failed 
> pipeline. I suggest that the datanode trigger the pipeline close command 
> immediately for every pipeline action close command triggered due to pipeline 
> failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to