Caleb Rackliffe created CASSANDRA-18733:
-------------------------------------------

             Summary: Waiting indefinitely on ReceivedMessage response in 
StreamSession#receive() can cause deadlock
                 Key: CASSANDRA-18733
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18733
             Project: Cassandra
          Issue Type: Bug
            Reporter: Caleb Rackliffe
            Assignee: Caleb Rackliffe


I've observed in a recent stack trace from a node running 4.1 what looks like a 
deadlock around the {{StreamSession}} monitor lock when 
{{StreamSession#receive()}} waits via {{syncUninteruptibly()}} for a response 
to a control message.

{noformat}
"Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms 
elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor entry 
 [0x00007f056d277000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
        - waiting to lock <0x00000006816fae70> (a 
org.apache.cassandra.streaming.StreamSession)
        at 
org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
        at 
org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
        at 
org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
        at 
org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown
 Source)
        at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
        at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
        at 
org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
        at 
org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
        at 
org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown
 Source)
        at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
        at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
        at 
org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
        at 
org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
        at 
org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
        at 
org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
        at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run([email protected]/Thread.java:829)
{noformat}

It seems that while {{receive()} is holding the monitor lock on 
{{StreamSession}}, the callback that executes on a different thread for the 
control message it sends carries an error. This error, when handled in 
{{onError()}}, then calls {{closeSession()}}, which tries to acquire the 
monitor lock already held in {{receive()}}.

{noformat}
"Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 
os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 
waiting on condition  [0x00007f0830947000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:323)
        at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
        at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
        at 
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
        at 
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
        at 
org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
        at 
org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
        at 
org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
        at 
org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
        at 
org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
        at 
org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
        at 
org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
        at 
org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
        at 
org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
        at 
org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
        at 
org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
        at 
org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
        at 
org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
        at 
org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
        - locked <0x00000006816fae70> (a 
org.apache.cassandra.streaming.StreamSession)
        at 
org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run([email protected]/Thread.java:829)
{noformat}

The most straightforward way to fix this might be to put an upper bound on the 
wait in {{receive()}}, along w/ some logging at WARN if it times out. This 
would at least allow us to make progress eventually and close the session 
properly. The syncUninterruptibly() was intended to avoid a race during 
shutdown, IIRC, and unless the timeout is comically low, it shouldn't 
compromise that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to