Jon Meredith created CASSANDRA-18443:
----------------------------------------
Summary: Deadlock updating sstable metadata if disk boundaries
need reloading
Key: CASSANDRA-18443
URL: https://issues.apache.org/jira/browse/CASSANDRA-18443
Project: Cassandra
Issue Type: Improvement
Components: Local/Compaction, Local/Memtable, Local/SSTable
Reporter: Jon Meredith
Assignee: Jon Meredith
{{CompactionStrategyManager.handleNotification}} holds the read lock while
processing notifications. When handling metadata changed notifications, an
extra call is made to maybeReloadDiskBoundaries which tries to grab the write
lock and deadlocks the thread.
Partial stacktrace
{code}
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000005cc000078> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock
at
org.apache.cassandra.db.compaction.CompactionStrategyManager.maybeReloadDiskBoundaries(CompactionStrategyManager.java:495)
at
org.apache.cassandra.db.compaction.CompactionStrategyManager.getCompactionStrategyFor(CompactionStrategyManager.java:343)
at
org.apache.cassandra.db.compaction.CompactionStrategyManager.handleMetadataChangedNotification(CompactionStrategyManager.java:796)
at
org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(CompactionStrategyManager.java:838)
at
org.apache.cassandra.db.lifecycle.Tracker.notifySSTableMetadataChanged(Tracker.java:482)
at
org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(CompactionStrategyManager.java:838)
{code}
Deadlocking with the read lock held blocks the SlabpoolCleaner while notifying
ColumnFamilyStore so memtables are prevented from being flushed and recycled,
causing any thread applying a mutation to the database (at least GossipStage
and MutationStage) to be considered down by peers and/or back up with pending
requests.
All the cases investigated were during single sstable upleveling by
{{org.apache.cassandra.db.compaction.SingleSSTableLCSTask}} added in
CASSANDRA-12526.
Other less critical work was also affected, JMX calls to get estimated
remaining compaction tasks, the index summary manager redistributing summaries,
the StatusLogger trying to log dropped messages, and the ValidationManager.
Workaround is to reboot the affected host.
The fix is to just remove the redundant disk boundary reload check on that path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]