[
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
guangbao zhao updated RATIS-2140:
---------------------------------
Description:
hi, [~szetszwo] I found a problem. In our service, when the leader notify the
follower of InstallSnapshot, the leader may cause the GrcpAppender thread to be
in the wait state due to timing issues, causing the installation of the
snapshot to fail, and triggering the follower to not receive the leader's
heartbeat within the specified timeout period to trigger the election.
The last log that triggered the exception
{code:java}
2024/08/17 19:36:19,068
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
org.apache.ratis.grpc.server.GrpcLogAppender:
node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with
firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
2024/08/17 19:36:19,068
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
org.apache.ratis.grpc.server.GrpcLogAppender:
node1@group-4F53D3317400->node2-GrpcLogAppender: send
node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
org.apache.ratis.grpc.server.GrpcLogAppender:
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a
reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
org.apache.ratis.grpc.server.GrpcLogAppender:
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: InstallSnapshot
in progress.
2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node1@group-4F53D3317400->node2-AppendLogResponseHandler: received
INCONSISTENCY reply with nextIndex 16857386183, errorCount=1,
request=AppendEntriesRequest:cid=11690239,entriesCount=0
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: receive
requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, i:16857386182))
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER:
reject PRE_VOTE from node2: this server is the leader and still has leadership
{code}
node2 grpc thread stack:
{code:java}
jstack 118659 | grep -A 12
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
#885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method) - parking to wait for
<0x00007f95477a9448> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48) at
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
at
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
at
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257) at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
Source) at java.lang.Thread.run(Thread.java:748) jstack 118659 | grep
-A 12
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
#885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method) - parking to wait for
<0x00007f95477a9448> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48) at
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
at
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
at
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257) at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
Source) at java.lang.Thread.run(Thread.java:748)
jstack 118659 | grep -A 12
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
#885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method) - parking to wait for
<0x00007f95477a9448> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48) at
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
at
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
at
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257) at
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
at
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
Source) at java.lang.Thread.run(Thread.java:748) {code}
It can be seen from the stack that the GrpcLogAppender node2 thread has been in
the wait state, causing the above problems. through the code, we can see that
after appendLog, notifyLogAppender will be called to call
getEventAwaitForSignal().signal(). If there is no appendLog request at this
moment, it will always hang here.
And I also did a test, when I perform a random write operation, the snapshot
will be performed again
> Thread wati when installing snapshot
> ------------------------------------
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
> Issue Type: Bug
> Reporter: guangbao zhao
> Priority: Major
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to
> be in the wait state due to timing issues, causing the installation of the
> snapshot to fail, and triggering the follower to not receive the leader's
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> {code:java}
> 2024/08/17 19:36:19,068
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-GrpcLogAppender: send
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler:
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1,
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400:
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138,
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER:
> reject PRE_VOTE from node2: this server is the leader and still has
> leadership {code}
> node2 grpc thread stack:
> {code:java}
> jstack 118659 | grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748) jstack 118659 |
> grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748)
> jstack 118659 | grep -A 12
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
> #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on
> condition [0x00007f85018db000] java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f95477a9448> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
> at
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
> at
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748) {code}
> It can be seen from the stack that the GrpcLogAppender node2 thread has been
> in the wait state, causing the above problems. through the code, we can see
> that after appendLog, notifyLogAppender will be called to call
> getEventAwaitForSignal().signal(). If there is no appendLog request at this
> moment, it will always hang here.
> And I also did a test, when I perform a random write operation, the snapshot
> will be performed again
--
This message was sent by Atlassian Jira
(v8.20.10#820010)