[
https://issues.apache.org/jira/browse/RATIS-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-1947:
-------------------------------
Component/s: Leader
> TimeoutIOException in WriteLog does not release Pending Requests
> ----------------------------------------------------------------
>
> Key: RATIS-1947
> URL: https://issues.apache.org/jira/browse/RATIS-1947
> Project: Ratis
> Issue Type: Bug
> Components: Leader
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: image-2023-12-01-12-04-24-409.png
>
>
> Recently there are some I/O issues in our cluster that causes WriteLog to
> time out with this exceptions
> {code:java}
> 2023-11-27 18:05:36,593
> [dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
> WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> Timeout 0/~
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
> WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType:
> WriteChunk traceID: "" containerID: 3470188 datanodeUuid:
> "18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID:
> "9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID:
> 3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 }
> chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len:
> 4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums:
> "\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336"
> checksums: "\313\231Bf" } } }, container
> path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
> ... 4 more {code}
> After these exceptions, it's seen that the Pending Requests in leader queue
> is stuck and unable to return to normal, even after the I/O performance
> returns to normal
> !image-2023-12-01-12-04-24-409.png!
> While the leader pending request limit hits did not increase, which means
> that no client requests failed with ResourceUnavailableException (but there
> were previous instances when these limits are hit). It seems that the client
> operations are blocked.
> Looking at the codes for a while, I suspected that TimeoutIOException of
> WriteLog did not release the leader's pending requests which causes the
> client operations to be blocked. However, I'm not entirely sure the expected
> behavior when TimeoutIOException is thrown.
> Our Ozone version is still using the Ratis version based on 2.4.1 release.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)