[
https://issues.apache.org/jira/browse/CASSANDRA-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jon Meredith updated CASSANDRA-18733:
-------------------------------------
Status: Ready to Commit (was: Review In Progress)
Starting commit
CI Results (pending):
||Branch||Source||Circle CI||Jenkins||
|cassandra-4.1|[branch|https://github.com/jonmeredith/cassandra/tree/commit_remote_branch/CASSANDRA-18733-cassandra-4.1-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|https://app.circleci.com/pipelines/github/jonmeredith/cassandra?branch=commit_remote_branch%2FCASSANDRA-18733-cassandra-4.1-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/2586/]|
|cassandra-5.0|[branch|https://github.com/jonmeredith/cassandra/tree/commit_remote_branch/CASSANDRA-18733-cassandra-5.0-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|https://app.circleci.com/pipelines/github/jonmeredith/cassandra?branch=commit_remote_branch%2FCASSANDRA-18733-cassandra-5.0-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/2587/]|
|trunk|[branch|https://github.com/jonmeredith/cassandra/tree/commit_remote_branch/CASSANDRA-18733-trunk-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|https://app.circleci.com/pipelines/github/jonmeredith/cassandra?branch=commit_remote_branch%2FCASSANDRA-18733-trunk-0B9591D6-E2C9-4E53-83DF-85E573176C6E]|[build|unknown]|
> Waiting indefinitely on ReceivedMessage response in StreamSession#receive()
> can cause deadlock
> ----------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18733
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18733
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair, Consistency/Streaming
> Reporter: Caleb Rackliffe
> Assignee: Jon Meredith
> Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> I've observed in a recent stack trace from a node running 4.1 what looks like
> a deadlock around the {{StreamSession}} monitor lock when
> {{StreamSession#receive()}} waits via {{syncUninteruptibly()}} for a response
> to a control message.
> {noformat}
> "Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms
> elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor
> entry [0x00007f056d277000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
> - waiting to lock <0x00000006816fae70> (a
> org.apache.cassandra.streaming.StreamSession)
> at
> org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
> at
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
> at
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
> at
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown
> Source)
> at
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
> at
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
> at
> org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
> at
> org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
> at
> org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown
> Source)
> at
> org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
> at
> org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
> at
> org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
> at
> org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
> at
> org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
> at
> org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> at
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> at
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run([email protected]/Thread.java:829)
> {noformat}
> It seems that while {{receive()} is holding the monitor lock on
> {{StreamSession}}, the callback that executes on a different thread for the
> control message it sends carries an error. This error, when handled in
> {{onError()}}, then calls {{closeSession()}}, which tries to acquire the
> monitor lock already held in {{receive()}}.
> {noformat}
> "Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5
> os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2
> waiting on condition [0x00007f0830947000]
> java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> at
> java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:323)
> at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
> at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
> at
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
> at
> org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
> at
> org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
> at
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
> at
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
> at
> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
> at
> org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
> at
> org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
> at
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
> at
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
> at
> org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
> at
> org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
> at
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
> at
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
> at
> org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
> at
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
> - locked <0x00000006816fae70> (a
> org.apache.cassandra.streaming.StreamSession)
> at
> org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run([email protected]/Thread.java:829)
> {noformat}
> The most straightforward way to fix this might be to put an upper bound on
> the wait in {{receive()}}, along w/ some logging at WARN if it times out.
> This would at least allow us to make progress eventually and close the
> session properly. The syncUninterruptibly() was intended to avoid a race
> during shutdown, IIRC, and unless the timeout is comically low, it shouldn't
> compromise that.
> The problem can usually be fixed by bouncing affected nodes, who will usually
> present w/ an increasing backlog of unrepaired data and a log message that
> looks something like this, in addition to a number of streaming errors:
> {noformat}
> INFO 2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>]
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 -
> [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to
> begin streaming <data directory>-nb-101920-big-Data.db. This message logs
> every 3 minutes
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]