Sumit Agrawal created HDDS-12469:
------------------------------------
Summary: fail fast for write block stuck
Key: HDDS-12469
URL: https://issues.apache.org/jira/browse/HDDS-12469
Project: Apache Ozone
Issue Type: Sub-task
Components: Ozone Datanode
Reporter: Sumit Agrawal
In follower, ContainerStateMachine's write() return future, which will actual
perform block/chunk write.
As part of check write,
* can create container if not exist
* write block chunk to disk
Under disk full condition / low disk, its taking huge time to process the write
chunk and seems stuck.
>From JMX metrics for DNs, its observed that Time taken (ns) in order of 10^14,
>10^13, ... that is, 100k second/10k seconds, .... shows process is really
>stuck and unable to come out.
{code:java}
jmxnode1_p1: "WriteStateMachineDataNsAvgTime" : 1.0438595905348E14
jmxnode2_p2: "WriteStateMachineDataNsAvgTime" : 2.2966696397828832E13
jmxnode2_p3: "WriteStateMachineDataNsAvgTime" : 1.4061009948751E13
jmxnode3_p4: "WriteStateMachineDataNsAvgTime" : 1.0024869351741E13
... {code}
This might be due to the reason of volume might be failed, later observed few
volume disk have issues.
>From logs of ratis, it keeps track and printing TimeoutException for the task
>every 10 sec.
{code:java}
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
WriteLog:115: (t:1, i:115), STATEMACHINELOGENTRY, cmdType: WriteChunk traceID:
"" containerID: 18446516 datanodeUuid: "2834c106-e999-4013-9934-a165fdbe41cf"
pipelineID: "f1efe128-22fe-4762-a248-7aebcaa07dff"
...
...{code}
Considering above scenario,
* Need make pipeline unhealthy if time taken is crossing certain threshold
(can be 10 min as max time for 256MB write or lesser), trigger pipeline closure
* need make current task stop and fail, and avoid accepting further raft logs
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]