[ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guangbao zhao updated RATIS-2140:
---------------------------------
    Summary: Thread wait when installing snapshot  (was: Thread wati when 
installing snapshot)

> Thread wait when installing snapshot
> ------------------------------------
>
>                 Key: RATIS-2140
>                 URL: https://issues.apache.org/jira/browse/RATIS-2140
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 3.0.1
>            Reporter: guangbao zhao
>            Priority: Major
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership {code}
> node2 grpc thread stack:
> {code:java}
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
> at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748)        jstack 118659 | 
> grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
>   at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748)
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
> at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748) {code}
> It can be seen from the stack that the GrpcLogAppender node2 thread has been 
> in the wait state, causing the above problems. through the code, we can see 
> that after appendLog, notifyLogAppender will be called to call 
> getEventAwaitForSignal().signal(). If there is no appendLog request at this 
> moment, it will always hang here.
> And I also did a test, when I perform a random write operation, the snapshot 
> will be performed again



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to