[
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
guangbao zhao updated RATIS-2140:
---------------------------------
Affects Version/s: 3.0.1
> Thread wati when installing snapshot
> ------------------------------------
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 3.0.1
> Reporter: guangbao zhao
> Priority: Major
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to
> be in the wait state due to timing issues, causing the installation of the
> snapshot to fail, and triggering the follower to not receive the leader's
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> {code:java}
> 2024/08/17 19:36:19,068
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-GrpcLogAppender: send
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler:
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1,
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400:
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138,
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER:
> reject PRE_VOTE from node2: this server is the leader and still has
> leadership {code}
> node2 grpc thread stack:
> {code:java}
> jstack 118659 | grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748) jstack 118659 |
> grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748)
> jstack 118659 | grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748) {code}
> It can be seen from the stack that the GrpcLogAppender node2 thread has been
> in the wait state, causing the above problems. through the code, we can see
> that after appendLog, notifyLogAppender will be called to call
> getEventAwaitForSignal().signal(). If there is no appendLog request at this
> moment, it will always hang here.
> And I also did a test, when I perform a random write operation, the snapshot
> will be performed again
--
This message was sent by Atlassian Jira
(v8.20.10#820010)