Ivan Andika created RATIS-1947:
----------------------------------
Summary: TimeoutIOException in WriteLog does not release Pending
Requests
Key: RATIS-1947
URL: https://issues.apache.org/jira/browse/RATIS-1947
Project: Ratis
Issue Type: Bug
Reporter: Ivan Andika
Assignee: Ivan Andika
Attachments: image-2023-12-01-12-04-24-409.png
Recently there are some I/O issues in our cluster that causes WriteLog to time
out with this exceptions
{code:java}
2023-11-27 18:05:36,593
[dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout
0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType:
WriteChunk traceID: "" containerID: 3470188 datanodeUuid:
"18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID:
"9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID:
3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 }
chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len:
4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums:
"\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336"
checksums: "\313\231Bf" } } }, container
path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
... 4 more {code}
After these exceptions, it's seen that the Pending Requests in leader queue is
stuck and unable to return to normal, even after the I/O performance returns to
normal
!image-2023-12-01-12-04-24-409.png!
While the leader pending request limit hits did not increase, which means that
no client requests failed with ResourceUnavailableException (but there were
previous instances when these limits are hit). It seems that the client
operations are blocked.
Looking at the codes for a while, I suspected that TimeoutIOException of
WriteLog did not release the leader's pending requests which causes the client
operations to be blocked. However, I'm not entirely sure the expected behavior
when TimeoutIOException is thrown.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)