[ 
https://issues.apache.org/jira/browse/CASSANDRA-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-18733:
-----------------------------------------
    Fix Version/s: 5.1
                       (was: 5.x)

> Waiting indefinitely on ReceivedMessage response in StreamSession#receive() 
> can cause deadlock
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18733
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18733
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair, Consistency/Streaming
>            Reporter: Caleb Rackliffe
>            Assignee: Jon Meredith
>            Priority: Normal
>             Fix For: 4.1.4, 5.0, 5.1
>
>
> I've observed in a recent stack trace from a node running 4.1 what looks like 
> a deadlock around the {{StreamSession}} monitor lock when 
> {{StreamSession#receive()}} waits via {{syncUninteruptibly()}} for a response 
> to a control message.
> {noformat}
> "Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms 
> elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor 
> entry  [0x00007f056d277000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
>         - waiting to lock <0x00000006816fae70> (a 
> org.apache.cassandra.streaming.StreamSession)
>         at 
> org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown
>  Source)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown
>  Source)
>         at 
> org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
>         at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>         at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
> {noformat}
> It seems that while {{receive()} is holding the monitor lock on 
> {{StreamSession}}, the callback that executes on a different thread for the 
> control message it sends carries an error. This error, when handled in 
> {{onError()}}, then calls {{closeSession()}}, which tries to acquire the 
> monitor lock already held in {{receive()}}.
> {noformat}
> "Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 
> os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 
> waiting on condition  [0x00007f0830947000]
>    java.lang.Thread.State: WAITING (parking)
>         at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
>         at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323)
>         at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>         at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>         at 
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
>         at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
>         at 
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
>         - locked <0x00000006816fae70> (a 
> org.apache.cassandra.streaming.StreamSession)
>         at 
> org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
> {noformat}
> The most straightforward way to fix this might be to put an upper bound on 
> the wait in {{receive()}}, along w/ some logging at WARN if it times out. 
> This would at least allow us to make progress eventually and close the 
> session properly. The syncUninterruptibly() was intended to avoid a race 
> during shutdown, IIRC, and unless the timeout is comically low, it shouldn't 
> compromise that.
> The problem can usually be fixed by bouncing affected nodes, who will usually 
> present w/ an increasing backlog of unrepaired data and a log message that 
> looks something like this, in addition to a number of streaming errors:
> {noformat}
> INFO  2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - 
> [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to 
> begin streaming <data directory>-nb-101920-big-Data.db. This message logs 
> every 3 minutes
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to