[jira] [Created] (HDDS-12469) fail fast for write block stuck

Sumit Agrawal (Jira) Tue, 04 Mar 2025 06:53:33 -0800

Sumit Agrawal created HDDS-12469:
------------------------------------

             Summary: fail fast for write block stuck
                 Key: HDDS-12469
                 URL: https://issues.apache.org/jira/browse/HDDS-12469
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: Ozone Datanode
            Reporter: Sumit Agrawal



In follower, ContainerStateMachine's write() return future, which will actual 
perform block/chunk write.

As part of check write,
 * can create container if not exist
 * write block chunk to disk

 

Under disk full condition / low disk, its taking huge time to process the write 
chunk and seems stuck.

>From JMX metrics for DNs, its observed that Time taken (ns) in order of 10^14, 
>10^13, ... that is, 100k second/10k seconds, .... shows process is really 
>stuck and unable to come out.

 
{code:java}
jmxnode1_p1:    "WriteStateMachineDataNsAvgTime" : 1.0438595905348E14
jmxnode2_p2:    "WriteStateMachineDataNsAvgTime" : 2.2966696397828832E13
jmxnode2_p3:    "WriteStateMachineDataNsAvgTime" : 1.4061009948751E13
jmxnode3_p4:    "WriteStateMachineDataNsAvgTime" : 1.0024869351741E13
... {code}
 

This might be due to the reason of volume might be failed, later observed few 
volume disk have issues.

 

>From logs of ratis, it keeps track and printing TimeoutException for the task 
>every 10 sec.
{code:java}
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
WriteLog:115: (t:1, i:115), STATEMACHINELOGENTRY, cmdType: WriteChunk traceID: 
"" containerID: 18446516 datanodeUuid: "2834c106-e999-4013-9934-a165fdbe41cf" 
pipelineID: "f1efe128-22fe-4762-a248-7aebcaa07dff" 
...
...{code}
Considering above scenario,
 * Need make pipeline unhealthy if time taken is crossing certain threshold 
(can be 10 min as max time for 256MB write or lesser), trigger pipeline closure
 * need make current task stop and fail, and avoid accepting further raft logs

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-12469) fail fast for write block stuck

Reply via email to