[
https://issues.apache.org/jira/browse/RATIS-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-1947:
-------------------------------
Description:
Recently there are some I/O issues in our Ozone cluster's datanodes that causes
WriteLog to time out with this exceptions
{code:java}
2023-11-27 18:05:36,593
[dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout
0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType:
WriteChunk traceID: "" containerID: 3470188 datanodeUuid:
"18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID:
"9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID:
3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 }
chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len:
4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums:
"\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336"
checksums: "\313\231Bf" } } }, container
path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
... 4 more {code}
This translate to state machine data write timeout shown in the metrics
dashboard
!image-2023-12-01-13-28-24-125.png!
After these exceptions, it's seen that the Pending Requests in leader queue is
stuck and unable to return to normal, even after the I/O performance returns to
normal
!image-2023-12-01-12-04-24-409.png!
While the leader pending request limit hits did not increase, which means that
no client requests failed with ResourceUnavailableException (but there were
previous instances when these limits are hit). It seems that the client
operations are blocked.
My expectations are that once I/O performance has returned, leader's pending
requests should have been removed, which also release the RequestLimits
semaphore. Looking at the LeaderStateImpl there are some pending
requests-related methods:
* Acquiring pending requests: called in RaftServerImpl#appendTransaction after
StateMachine (e.g. ContainerStateMachine) finish building the
TransactionContext in StateMachine#startTransaction
** tryAcquirePendingRequest: Acquiring the permit based on the current size
and number of pending requests
*** If the pending requests size and limits are hit,
ResourceUnavailableException should be thrown
** addPendingRequest: Using the permit, add the client request to the pending
requests
* Releasing pending requests: called when applying the log to state machine
(means the transaction / log entry has been successfully committed)
** replyPendingRequest: Will release pending request element and size
Looking at the codes for a bit, I suspected that TimeoutIOException of WriteLog
did not release the leader's pending requests which seems to cause some client
operations to be blocked. However, I'm not entirely sure the expected behavior
when TimeoutIOException is thrown (will it be retried indefinitely?).
Our Ozone version is still using the Ratis version based on 2.4.1 release.
was:
Recently there are some I/O issues in our Ozone cluster's datanodes that causes
WriteLog to time out with this exceptions
{code:java}
2023-11-27 18:05:36,593
[dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout
0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType:
WriteChunk traceID: "" containerID: 3470188 datanodeUuid:
"18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID:
"9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID:
3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 }
chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len:
4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums:
"\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336"
checksums: "\313\231Bf" } } }, container
path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
at
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
... 4 more {code}
This translate to state machine data write timeout shown in the metrics
dashboard
!image-2023-12-01-13-28-24-125.png!
After these exceptions, it's seen that the Pending Requests in leader queue is
stuck and unable to return to normal, even after the I/O performance returns to
normal
!image-2023-12-01-12-04-24-409.png!
While the leader pending request limit hits did not increase, which means that
no client requests failed with ResourceUnavailableException (but there were
previous instances when these limits are hit). It seems that the client
operations are blocked. We restarted the datanodes to restore the pending
requests.
My expectations are that once I/O performance has returned, the pending
requests should have been removed, which also release the RequestLimits
semaphore. Looking at the codes for a while, I suspected that
TimeoutIOException of WriteLog did not release the leader's pending requests
which seems to cause some client operations to be blocked. However, I'm not
entirely sure the expected behavior when TimeoutIOException is thrown.
Our Ozone version is still using the Ratis version based on 2.4.1 release.
> TimeoutIOException in WriteLog does not release Pending Requests
> ----------------------------------------------------------------
>
> Key: RATIS-1947
> URL: https://issues.apache.org/jira/browse/RATIS-1947
> Project: Ratis
> Issue Type: Bug
> Components: Leader
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: image-2023-12-01-12-04-24-409.png,
> image-2023-12-01-13-28-24-125.png
>
>
> Recently there are some I/O issues in our Ozone cluster's datanodes that
> causes WriteLog to time out with this exceptions
> {code:java}
> 2023-11-27 18:05:36,593
> [dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker]
> WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> Timeout 0/~
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
> WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType:
> WriteChunk traceID: "" containerID: 3470188 datanodeUuid:
> "18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID:
> "9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID:
> 3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 }
> chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len:
> 4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums:
> "\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336"
> checksums: "\313\231Bf" } } }, container
> path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
> ... 4 more {code}
> This translate to state machine data write timeout shown in the metrics
> dashboard
> !image-2023-12-01-13-28-24-125.png!
> After these exceptions, it's seen that the Pending Requests in leader queue
> is stuck and unable to return to normal, even after the I/O performance
> returns to normal
> !image-2023-12-01-12-04-24-409.png!
> While the leader pending request limit hits did not increase, which means
> that no client requests failed with ResourceUnavailableException (but there
> were previous instances when these limits are hit). It seems that the client
> operations are blocked.
> My expectations are that once I/O performance has returned, leader's pending
> requests should have been removed, which also release the RequestLimits
> semaphore. Looking at the LeaderStateImpl there are some pending
> requests-related methods:
> * Acquiring pending requests: called in RaftServerImpl#appendTransaction
> after StateMachine (e.g. ContainerStateMachine) finish building the
> TransactionContext in StateMachine#startTransaction
> ** tryAcquirePendingRequest: Acquiring the permit based on the current size
> and number of pending requests
> *** If the pending requests size and limits are hit,
> ResourceUnavailableException should be thrown
> ** addPendingRequest: Using the permit, add the client request to the
> pending requests
> * Releasing pending requests: called when applying the log to state machine
> (means the transaction / log entry has been successfully committed)
> ** replyPendingRequest: Will release pending request element and size
> Looking at the codes for a bit, I suspected that TimeoutIOException of
> WriteLog did not release the leader's pending requests which seems to cause
> some client operations to be blocked. However, I'm not entirely sure the
> expected behavior when TimeoutIOException is thrown (will it be retried
> indefinitely?).
> Our Ozone version is still using the Ratis version based on 2.4.1 release.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)