[ 
https://issues.apache.org/jira/browse/RATIS-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-1947:
-------------------------------
    Component/s: Leader

> TimeoutIOException in WriteLog does not release Pending Requests
> ----------------------------------------------------------------
>
>                 Key: RATIS-1947
>                 URL: https://issues.apache.org/jira/browse/RATIS-1947
>             Project: Ratis
>          Issue Type: Bug
>          Components: Leader
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>         Attachments: image-2023-12-01-12-04-24-409.png
>
>
> Recently there are some I/O issues in our cluster that causes WriteLog to 
> time out with this exceptions
> {code:java}
> 2023-11-27 18:05:36,593 
> [dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
>  WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> Timeout 0/~
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
> WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType: 
> WriteChunk traceID: "" containerID: 3470188 datanodeUuid: 
> "18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID: 
> "9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID: 
> 3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 } 
> chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len: 
> 4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: 
> "\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336" 
> checksums: "\313\231Bf" } } }, container 
> path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.TimeoutException
>         at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
>         ... 4 more {code}
> After these exceptions, it's seen that the Pending Requests in leader queue 
> is stuck and unable to return to normal, even after the I/O performance 
> returns to normal
> !image-2023-12-01-12-04-24-409.png!
> While the leader pending request limit hits did not increase, which means 
> that no client requests failed with ResourceUnavailableException (but there 
> were previous instances when these limits are hit). It seems that the client 
> operations are blocked. 
> Looking at the codes for a while, I suspected that TimeoutIOException of 
> WriteLog did not release the leader's pending requests which causes the 
> client operations to be blocked. However, I'm not entirely sure the expected 
> behavior when TimeoutIOException is thrown.
> Our Ozone version is still using the Ratis version based on 2.4.1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to