Dmitry Konstantinov created CASSANDRA-21384:
-----------------------------------------------
Summary: Deadlock between GlobalLogFollower and GossipStage
Key: CASSANDRA-21384
URL: https://issues.apache.org/jira/browse/CASSANDRA-21384
Project: Apache Cassandra
Issue Type: Bug
Components: Transactional Cluster Metadata
Reporter: Dmitry Konstantinov
Attachments: stuck_dtest_log.txt.zip
Hi, I've noticed DecommissionAvoidReadTimeoutsTest test timeout:
[https://pre-ci.cassandra.apache.org/job/cassandra/543/pipeline-overview/?selected-node=292]
Here there is a full log: [^stuck_dtest_log.txt.zip]
It looks like we have a deadlock here:
{code:java}
[2026-05-17T13:39:14.911Z] [junit-timeout] "node3_GlobalLogFollower" #183
prio=5 os_prio=0 cpu=211.11ms elapsed=890.37s tid=0x00007fe9a423cb30 nid=0x33b6
waiting on condition [0x00007fe97819c000]
[2026-05-17T13:39:14.911Z] [junit-timeout] java.lang.Thread.State: WAITING
(parking)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
jdk.internal.misc.Unsafe.park([email protected]/Native Method)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:190)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:54)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:249)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.gms.Gossiper.runInGossipStageBlocking(Gossiper.java:544)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.gms.Gossiper.mergeNodeToGossip(Gossiper.java:2263)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.gms.Gossiper.mergeNodeToGossip(Gossiper.java:2248)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.tcm.listeners.LegacyStateListener.processChangesToLocalState(LegacyStateListener.java:160)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.tcm.listeners.LegacyStateListener.notifyPostCommit(LegacyStateListener.java:86)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog.notifyPostCommit(LocalLog.java:618)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog.processPendingInternal(LocalLog.java:532)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog$Async$AsyncRunnable.run(LocalLog.java:800)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:119)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
org.apache.cassandra.concurrent.InfiniteLoopExecutor$$Lambda$4957/0x00007fea01311a10.run(Unknown
Source)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2026-05-17T13:39:14.911Z] [junit-timeout] at
java.lang.Thread.run([email protected]/Thread.java:840)
[2026-05-17T13:39:14.911Z] [junit-timeout]
[2026-05-17T13:39:14.911Z] [junit-timeout] Locked ownable synchronizers:
[2026-05-17T13:39:14.911Z] [junit-timeout] - <0x00000000c81936a8> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
[2026-05-17T13:39:14.911Z] [junit-timeout]
[2026-05-17T13:39:14.924Z] [junit-timeout] "node3_GossipStage:1" #303 daemon
prio=5 os_prio=0 cpu=27.30ms elapsed=888.43s tid=0x00007fe9a8019d90 nid=0x342e
waiting on condition [0x00007fe951cee000]
[2026-05-17T13:39:14.924Z] [junit-timeout] java.lang.Thread.State: WAITING
(parking)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
jdk.internal.misc.Unsafe.park([email protected]/Native Method)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:322)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:300)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog$Async.runOnce(LocalLog.java:717)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog.runOnce(LocalLog.java:437)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.log.LocalLog.waitForHighestConsecutive(LocalLog.java:356)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher.lambda$fetchLogEntriesAndWaitInternal$1(PeerLogFetcher.java:103)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher$$Lambda$14351/0x00007fea02cfb390.apply(Unknown
Source)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AbstractFuture.lambda$map$0(AbstractFuture.java:345)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AbstractFuture$$Lambda$12637/0x00007fea029a8e18.run(Unknown
Source)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:193)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:184)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList$RunnableWithExecutor.notifySelf(ListenerList.java:372)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:151)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$7194/0x00007fea01ca3890.accept(Unknown
Source)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.IntrusiveStack$$Lambda$7196/0x00007fea01ca3ac8.accept(Unknown
Source)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:244)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:237)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:227)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:151)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:116)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AsyncFuture.appendListener(AsyncFuture.java:122)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AbstractFuture.addListener(AbstractFuture.java:433)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AbstractFuture.map(AbstractFuture.java:342)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.AsyncFuture.map(AsyncFuture.java:144)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.utils.concurrent.Future.map(Future.java:166)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher.fetchLogEntriesAndWaitInternal(PeerLogFetcher.java:101)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher.lambda$asyncFetchLog$0(PeerLogFetcher.java:78)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher$$Lambda$14347/0x00007fea02cfac90.get(Unknown
Source)
[2026-05-17T13:39:14.924Z] [junit-timeout] at
org.apache.cassandra.tcm.EpochAwareDebounce.getAsync(EpochAwareDebounce.java:83)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.tcm.PeerLogFetcher.asyncFetchLog(PeerLogFetcher.java:78)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.tcm.ClusterMetadataService.fetchLogFromPeerAsync(ClusterMetadataService.java:761)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.gms.GossipVerbHandler.doVerb(GossipVerbHandler.java:32)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:92)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:104)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink$$Lambda$10290/0x00007fea022217a0.accept(Unknown
Source)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:72)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:58)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:124)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:53)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:460)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:138)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2026-05-17T13:39:14.925Z] [junit-timeout] at
java.lang.Thread.run([email protected]/Thread.java:840)
[2026-05-17T13:39:14.925Z] [junit-timeout]
[2026-05-17T13:39:14.925Z] [junit-timeout] Locked ownable synchronizers:
[2026-05-17T13:39:14.925Z] [junit-timeout] - <0x00000000d7f82ff8> (a
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}
An analysis from Claude (after few iterations):
h2. Root Cause: Circular deadlock between TCM log processing and GossipStage on
node3
The test never reaches the decommission phase. The cluster hangs during startup
because
of a circular wait between node3's TCM (Transactional Cluster Metadata) log
processor
thread and its GossipStage thread.
h3. The deadlock cycle
{code:java}
node3_GlobalLogFollower
→ processPendingInternal()
→ notifyPostCommit()
→ LegacyStateListener.processChangesToLocalState()
→ Gossiper.mergeNodeToGossip()
→ Gossiper.runInGossipStageBlocking() ← BLOCKS waiting for GossipStage
↑
│ waits for
│
node3_GossipStage:1 │
→ GossipVerbHandler.doVerb() │
→ ClusterMetadataService.fetchLogFromPeerAsync() │
→ PeerLogFetcher.fetchLogEntriesAndWaitInternal() │
→ LocalLog.waitForHighestConsecutive() │
→ LocalLog.runOnce() ← BLOCKS waiting for log to advance
│
can't advance because
GlobalLogFollower is stuck ↑
{code}
h3. Why this deadlocks
1. *node3_GossipStage* receives a gossip message
({{{}GossipVerbHandler.doVerb{}}}), discovers the peer has a newer TCM epoch,
and calls {{{}fetchLogFromPeerAsync(){}}}. This triggers a synchronous wait
({{{}waitForHighestConsecutive{}}}) for the local log to catch up to the
fetched epoch. It *occupies the single GossipStage thread* while waiting.
2. *node3_GlobalLogFollower* is the thread that processes TCM log entries. When
processing a new entry, it fires post-commit listeners. The
{{LegacyStateListener}} calls
{{Gossiper.runInGossipStageBlocking()}} which submits work to the GossipStage
and blocks waiting for it to complete.
3. Since GossipStage is already occupied by step 1, GlobalLogFollower blocks.
Since
GlobalLogFollower is blocked, the local log never advances. Since the log never
advances,
GossipStage's wait never completes. *Deadlock.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]