Dmitry Konstantinov created CASSANDRA-21384:
-----------------------------------------------

             Summary: Deadlock between GlobalLogFollower and GossipStage
                 Key: CASSANDRA-21384
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21384
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Transactional Cluster Metadata
            Reporter: Dmitry Konstantinov
         Attachments: stuck_dtest_log.txt.zip

Hi, I've noticed DecommissionAvoidReadTimeoutsTest test timeout: 
[https://pre-ci.cassandra.apache.org/job/cassandra/543/pipeline-overview/?selected-node=292]

Here there is a full log: [^stuck_dtest_log.txt.zip]

It looks like we have a deadlock here:
{code:java}
[2026-05-17T13:39:14.911Z] [junit-timeout] "node3_GlobalLogFollower" #183 
prio=5 os_prio=0 cpu=211.11ms elapsed=890.37s tid=0x00007fe9a423cb30 nid=0x33b6 
waiting on condition  [0x00007fe97819c000]
[2026-05-17T13:39:14.911Z] [junit-timeout]    java.lang.Thread.State: WAITING 
(parking)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
jdk.internal.misc.Unsafe.park([email protected]/Native Method)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:190)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:54)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:249)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.gms.Gossiper.runInGossipStageBlocking(Gossiper.java:544)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.gms.Gossiper.mergeNodeToGossip(Gossiper.java:2263)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.gms.Gossiper.mergeNodeToGossip(Gossiper.java:2248)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.tcm.listeners.LegacyStateListener.processChangesToLocalState(LegacyStateListener.java:160)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.tcm.listeners.LegacyStateListener.notifyPostCommit(LegacyStateListener.java:86)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog.notifyPostCommit(LocalLog.java:618)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog.processPendingInternal(LocalLog.java:532)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog$Async$AsyncRunnable.run(LocalLog.java:800)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:119)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
org.apache.cassandra.concurrent.InfiniteLoopExecutor$$Lambda$4957/0x00007fea01311a10.run(Unknown
 Source)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2026-05-17T13:39:14.911Z] [junit-timeout]      at 
java.lang.Thread.run([email protected]/Thread.java:840)
[2026-05-17T13:39:14.911Z] [junit-timeout] 
[2026-05-17T13:39:14.911Z] [junit-timeout]    Locked ownable synchronizers:
[2026-05-17T13:39:14.911Z] [junit-timeout]      - <0x00000000c81936a8> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
[2026-05-17T13:39:14.911Z] [junit-timeout] 




[2026-05-17T13:39:14.924Z] [junit-timeout] "node3_GossipStage:1" #303 daemon 
prio=5 os_prio=0 cpu=27.30ms elapsed=888.43s tid=0x00007fe9a8019d90 nid=0x342e 
waiting on condition  [0x00007fe951cee000]
[2026-05-17T13:39:14.924Z] [junit-timeout]    java.lang.Thread.State: WAITING 
(parking)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
jdk.internal.misc.Unsafe.park([email protected]/Native Method)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:322)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:300)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog$Async.runOnce(LocalLog.java:717)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog.runOnce(LocalLog.java:437)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.log.LocalLog.waitForHighestConsecutive(LocalLog.java:356)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher.lambda$fetchLogEntriesAndWaitInternal$1(PeerLogFetcher.java:103)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher$$Lambda$14351/0x00007fea02cfb390.apply(Unknown
 Source)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AbstractFuture.lambda$map$0(AbstractFuture.java:345)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AbstractFuture$$Lambda$12637/0x00007fea029a8e18.run(Unknown
 Source)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:193)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:184)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList$RunnableWithExecutor.notifySelf(ListenerList.java:372)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:151)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$7194/0x00007fea01ca3890.accept(Unknown
 Source)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.IntrusiveStack$$Lambda$7196/0x00007fea01ca3ac8.accept(Unknown
 Source)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:244)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:237)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:227)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:151)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:116)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AsyncFuture.appendListener(AsyncFuture.java:122)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AbstractFuture.addListener(AbstractFuture.java:433)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AbstractFuture.map(AbstractFuture.java:342)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.AsyncFuture.map(AsyncFuture.java:144)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.utils.concurrent.Future.map(Future.java:166)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher.fetchLogEntriesAndWaitInternal(PeerLogFetcher.java:101)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher.lambda$asyncFetchLog$0(PeerLogFetcher.java:78)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher$$Lambda$14347/0x00007fea02cfac90.get(Unknown
 Source)
[2026-05-17T13:39:14.924Z] [junit-timeout]      at 
org.apache.cassandra.tcm.EpochAwareDebounce.getAsync(EpochAwareDebounce.java:83)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.tcm.PeerLogFetcher.asyncFetchLog(PeerLogFetcher.java:78)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.tcm.ClusterMetadataService.fetchLogFromPeerAsync(ClusterMetadataService.java:761)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.gms.GossipVerbHandler.doVerb(GossipVerbHandler.java:32)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:92)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:104)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink$$Lambda$10290/0x00007fea022217a0.accept(Unknown
 Source)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:72)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:58)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:124)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:53)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:460)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:138)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2026-05-17T13:39:14.925Z] [junit-timeout]      at 
java.lang.Thread.run([email protected]/Thread.java:840)
[2026-05-17T13:39:14.925Z] [junit-timeout] 
[2026-05-17T13:39:14.925Z] [junit-timeout]    Locked ownable synchronizers:
[2026-05-17T13:39:14.925Z] [junit-timeout]      - <0x00000000d7f82ff8> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}
 

An analysis from Claude (after few iterations):
h2. Root Cause: Circular deadlock between TCM log processing and GossipStage on 
node3

The test never reaches the decommission phase. The cluster hangs during startup 
because
of a circular wait between node3's TCM (Transactional Cluster Metadata) log 
processor
thread and its GossipStage thread.
h3. The deadlock cycle
{code:java}
node3_GlobalLogFollower
  → processPendingInternal()
  → notifyPostCommit()
  → LegacyStateListener.processChangesToLocalState()
  → Gossiper.mergeNodeToGossip()
  → Gossiper.runInGossipStageBlocking()    ← BLOCKS waiting for GossipStage
                                                    ↑
                                                    │ waits for
                                                    │
node3_GossipStage:1                                 │
  → GossipVerbHandler.doVerb()                      │
  → ClusterMetadataService.fetchLogFromPeerAsync()  │
  → PeerLogFetcher.fetchLogEntriesAndWaitInternal() │
  → LocalLog.waitForHighestConsecutive()            │
  → LocalLog.runOnce()                     ← BLOCKS waiting for log to advance
                                                    │
                                            can't advance because
                                            GlobalLogFollower is stuck ↑
{code}
h3. Why this deadlocks

1. *node3_GossipStage* receives a gossip message 
({{{}GossipVerbHandler.doVerb{}}}), discovers the peer has a newer TCM epoch, 
and calls {{{}fetchLogFromPeerAsync(){}}}. This triggers a synchronous wait 
({{{}waitForHighestConsecutive{}}}) for the local log to catch up to the 
fetched epoch. It *occupies the single GossipStage thread* while waiting.

2. *node3_GlobalLogFollower* is the thread that processes TCM log entries. When 
processing a new entry, it fires post-commit listeners. The 
{{LegacyStateListener}} calls
{{Gossiper.runInGossipStageBlocking()}} which submits work to the GossipStage 
and blocks waiting for it to complete.

3. Since GossipStage is already occupied by step 1, GlobalLogFollower blocks. 
Since
GlobalLogFollower is blocked, the local log never advances. Since the log never 
advances,
GossipStage's wait never completes. *Deadlock.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to