[ 
https://issues.apache.org/jira/browse/RATIS-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-1947:
-------------------------------
    Description: 
Recently there are some I/O issues in our cluster that causes WriteLog to time 
out with this exceptions
{code:java}
2023-11-27 18:05:36,593 
[dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
 WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout 
0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType: 
WriteChunk traceID: "" containerID: 3470188 datanodeUuid: 
"18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID: 
"9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID: 
3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 } 
chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len: 
4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: 
"\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336" 
checksums: "\313\231Bf" } } }, container 
path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
        ... 4 more {code}
After these exceptions, it's seen that the Pending Requests in leader queue is 
stuck and unable to return to normal, even after the I/O performance returns to 
normal

!image-2023-12-01-12-04-24-409.png!

While the leader pending request limit hits did not increase, which means that 
no client requests failed with ResourceUnavailableException (but there were 
previous instances when these limits are hit). It seems that the client 
operations are blocked. 

Looking at the codes for a while, I suspected that TimeoutIOException of 
WriteLog did not release the leader's pending requests which causes the client 
operations to be blocked. However, I'm not entirely sure the expected behavior 
when TimeoutIOException is thrown.

Our Ozone version is still using the Ratis version based on 2.4.1 release.

  was:
Recently there are some I/O issues in our cluster that causes WriteLog to time 
out with this exceptions
{code:java}
2023-11-27 18:05:36,593 
[dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
 WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout 
0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType: 
WriteChunk traceID: "" containerID: 3470188 datanodeUuid: 
"18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID: 
"9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID: 
3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 } 
chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len: 
4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: 
"\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336" 
checksums: "\313\231Bf" } } }, container 
path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
        ... 4 more {code}
After these exceptions, it's seen that the Pending Requests in leader queue is 
stuck and unable to return to normal, even after the I/O performance returns to 
normal

!image-2023-12-01-12-04-24-409.png!

While the leader pending request limit hits did not increase, which means that 
no client requests failed with ResourceUnavailableException (but there were 
previous instances when these limits are hit). It seems that the client 
operations are blocked. 

Looking at the codes for a while, I suspected that TimeoutIOException of 
WriteLog did not release the leader's pending requests which causes the client 
operations to be blocked. However, I'm not entirely sure the expected behavior 
when TimeoutIOException is thrown.


> TimeoutIOException in WriteLog does not release Pending Requests
> ----------------------------------------------------------------
>
>                 Key: RATIS-1947
>                 URL: https://issues.apache.org/jira/browse/RATIS-1947
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>         Attachments: image-2023-12-01-12-04-24-409.png
>
>
> Recently there are some I/O issues in our cluster that causes WriteLog to 
> time out with this exceptions
> {code:java}
> 2023-11-27 18:05:36,593 
> [dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
>  WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> Timeout 0/~
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
> WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType: 
> WriteChunk traceID: "" containerID: 3470188 datanodeUuid: 
> "18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID: 
> "9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID: 
> 3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 } 
> chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len: 
> 4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: 
> "\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336" 
> checksums: "\313\231Bf" } } }, container 
> path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.TimeoutException
>         at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
>         ... 4 more {code}
> After these exceptions, it's seen that the Pending Requests in leader queue 
> is stuck and unable to return to normal, even after the I/O performance 
> returns to normal
> !image-2023-12-01-12-04-24-409.png!
> While the leader pending request limit hits did not increase, which means 
> that no client requests failed with ResourceUnavailableException (but there 
> were previous instances when these limits are hit). It seems that the client 
> operations are blocked. 
> Looking at the codes for a while, I suspected that TimeoutIOException of 
> WriteLog did not release the leader's pending requests which causes the 
> client operations to be blocked. However, I'm not entirely sure the expected 
> behavior when TimeoutIOException is thrown.
> Our Ozone version is still using the Ratis version based on 2.4.1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to