[
https://issues.apache.org/jira/browse/RATIS-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029589#comment-18029589
]
Tsz-wo Sze commented on RATIS-2345:
-----------------------------------
The stack traces [^result-3-18450620115-split-2.zip] showing the deadlock:
{code}
2025-10-12 23:01:12,060 [Timer-1] INFO netty.TestRaftAsyncWithNetty
(RaftBasicTests.java:lambda$run$1(371)) - "s1@group-680D12652185-FollowerState"
Id=170 BLOCKED on org.apache.ratis.server.impl.RaftServerImpl@23e662ef owned by
"s1@group-680D12652185-LeaderStateImpl" Id=144
at
org.apache.ratis.server.impl.FollowerState.runImpl(FollowerState.java:160)
- blocked on org.apache.ratis.server.impl.RaftServerImpl@23e662ef
at
org.apache.ratis.server.impl.FollowerState.run(FollowerState.java:130)
2025-10-12 23:01:12,061 [Timer-1] INFO netty.TestRaftAsyncWithNetty
(RaftBasicTests.java:lambda$run$1(371)) -
"s1@group-680D12652185->s0-LogAppenderDefault-LogAppenderDaemon" Id=145 BLOCKED
on org.apache.ratis.server.impl.RaftServerImpl@23e662ef owned by
"s1@group-680D12652185-LeaderStateImpl" Id=144
at
org.apache.ratis.server.leader.LogAppender.onFollowerTerm(LogAppender.java:213)
- blocked on org.apache.ratis.server.impl.RaftServerImpl@23e662ef
at
org.apache.ratis.server.leader.LogAppenderDefault.handleReply(LogAppenderDefault.java:197)
at
org.apache.ratis.server.leader.LogAppenderDefault.run(LogAppenderDefault.java:165)
at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$1040/1375983778.run(Unknown
Source)
at java.lang.Thread.run(Thread.java:750)
// holding 23e662ef
2025-10-12 23:01:12,061 [Timer-1] INFO netty.TestRaftAsyncWithNetty
(RaftBasicTests.java:lambda$run$1(371)) -
"s1@group-680D12652185-LeaderStateImpl" Id=144 WAITING on
java.util.concurrent.CompletableFuture$Signaller@49255ac3
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CompletableFuture$Signaller@49255ac3
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
at
org.apache.ratis.server.impl.LeaderStateImpl.stepDown(LeaderStateImpl.java:706)
at
org.apache.ratis.server.impl.LeaderStateImpl.lambda$submitStepDownEvent$12(LeaderStateImpl.java:700)
{code}
> Leader stepDown could cause a deadlock
> --------------------------------------
>
> Key: RATIS-2345
> URL: https://issues.apache.org/jira/browse/RATIS-2345
> Project: Ratis
> Issue Type: Bug
> Components: Leader
> Reporter: Tsz-wo Sze
> Assignee: Tsz-wo Sze
> Priority: Major
> Attachments: result-3-18450620115-split-2.zip
>
>
> Leader stepDown could cause a deadlock:
> - LeaderStateImpl.stepDown(..), who is holding the RaftServerImpl lock, joins
> the future returned from server.changeToFollowerAndPersistMetadata(..)
> -- the future completes after RoleInfo.shutdownLeaderState(..),
> -- which calls LeaderStateImpl.stop()
> -- which waits for all LogAppender to stop.
> - However, LogAppender may waits for the RaftServerImpl lock in
> LogAppender.onFollowerTerm(..)
> -----
> (Original description)
> In the 10x10 run below, it has 3/100 failures. All failed with timeout.
> - https://github.com/apache/ratis/actions/runs/18450620115/job/52563900327
--
This message was sent by Atlassian Jira
(v8.20.10#820010)