[
https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860367#comment-17860367
]
Haibo Sun commented on RATIS-2116:
----------------------------------
[~szetszwo] https://github.com/apache/ratis/pull/1116
> Follower state synchronization is blocked
> -----------------------------------------
>
> Key: RATIS-2116
> URL: https://issues.apache.org/jira/browse/RATIS-2116
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 3.0.0, 2.5.1, 3.0.1
> Reporter: Haibo Sun
> Priority: Major
> Attachments: debug.log
>
>
> Using version 2.5.1, we have discovered that in some cases, the state
> synchronization of the follower will be permanently blocked.
> Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern
> (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of
> RaftServerImpl.appendEntries does not immediately flush data and complete the
> result future, because there is a pending PurgeLog task in the queue. It
> enqueues the result future to be completed after the latter WriteLog flushes
> data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and
> will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This
> leads to a deadlock and causes the state synchronization to stop.
> I confirmed this by adding debug logs, detailed information is attached
> below. This issue can be easily reproduced by increasing the frequency of
> TakeSnapshot and PurgeLog operations. In addition, after checking the code in
> the master branch, this issue still exists.
>
> *jstack:*
> {code:java}
> "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x00007fc58400b800
> nid=0x5493a waiting on condition [0x00007fc5b4f28000] java.lang.Thread.State:
> WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for
> <0x00007fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at
> sun.misc.Unsafe.park(Unsafe.java:1025) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934)
> at
> org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379)
> at
> org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649)
> at
> org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231)
> at
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95)
> at
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91)
> at
> org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
> at
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> at
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:882){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)