[ 
https://issues.apache.org/jira/browse/RATIS-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029589#comment-18029589
 ] 

Tsz-wo Sze commented on RATIS-2345:
-----------------------------------

The stack traces [^result-3-18450620115-split-2.zip] showing the deadlock:
{code}
2025-10-12 23:01:12,060 [Timer-1] INFO  netty.TestRaftAsyncWithNetty 
(RaftBasicTests.java:lambda$run$1(371)) - "s1@group-680D12652185-FollowerState" 
Id=170 BLOCKED on org.apache.ratis.server.impl.RaftServerImpl@23e662ef owned by 
"s1@group-680D12652185-LeaderStateImpl" Id=144
        at 
org.apache.ratis.server.impl.FollowerState.runImpl(FollowerState.java:160)
        -  blocked on org.apache.ratis.server.impl.RaftServerImpl@23e662ef
        at 
org.apache.ratis.server.impl.FollowerState.run(FollowerState.java:130)


2025-10-12 23:01:12,061 [Timer-1] INFO  netty.TestRaftAsyncWithNetty 
(RaftBasicTests.java:lambda$run$1(371)) - 
"s1@group-680D12652185->s0-LogAppenderDefault-LogAppenderDaemon" Id=145 BLOCKED 
on org.apache.ratis.server.impl.RaftServerImpl@23e662ef owned by 
"s1@group-680D12652185-LeaderStateImpl" Id=144
        at 
org.apache.ratis.server.leader.LogAppender.onFollowerTerm(LogAppender.java:213)
        -  blocked on org.apache.ratis.server.impl.RaftServerImpl@23e662ef
        at 
org.apache.ratis.server.leader.LogAppenderDefault.handleReply(LogAppenderDefault.java:197)
        at 
org.apache.ratis.server.leader.LogAppenderDefault.run(LogAppenderDefault.java:165)
        at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
        at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$1040/1375983778.run(Unknown
 Source)
        at java.lang.Thread.run(Thread.java:750)


// holding 23e662ef
2025-10-12 23:01:12,061 [Timer-1] INFO  netty.TestRaftAsyncWithNetty 
(RaftBasicTests.java:lambda$run$1(371)) - 
"s1@group-680D12652185-LeaderStateImpl" Id=144 WAITING on 
java.util.concurrent.CompletableFuture$Signaller@49255ac3
        at sun.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.CompletableFuture$Signaller@49255ac3
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at 
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
        at 
org.apache.ratis.server.impl.LeaderStateImpl.stepDown(LeaderStateImpl.java:706)
        at 
org.apache.ratis.server.impl.LeaderStateImpl.lambda$submitStepDownEvent$12(LeaderStateImpl.java:700)
{code}

> Leader stepDown could cause a deadlock
> --------------------------------------
>
>                 Key: RATIS-2345
>                 URL: https://issues.apache.org/jira/browse/RATIS-2345
>             Project: Ratis
>          Issue Type: Bug
>          Components: Leader
>            Reporter: Tsz-wo Sze
>            Assignee: Tsz-wo Sze
>            Priority: Major
>         Attachments: result-3-18450620115-split-2.zip
>
>
> Leader stepDown could cause a deadlock:
> - LeaderStateImpl.stepDown(..), who is holding the RaftServerImpl lock, joins 
> the future returned from server.changeToFollowerAndPersistMetadata(..) 
> -- the future completes after RoleInfo.shutdownLeaderState(..),
> -- which calls LeaderStateImpl.stop()
> -- which waits for all LogAppender to stop.
> - However, LogAppender may waits for the RaftServerImpl lock in 
> LogAppender.onFollowerTerm(..)
> -----
> (Original description)
> In the 10x10 run below, it has 3/100 failures. All failed with timeout.
> - https://github.com/apache/ratis/actions/runs/18450620115/job/52563900327



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to