[ https://issues.apache.org/jira/browse/CASSANDRA-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272641#comment-16272641 ]
Paulo Motta commented on CASSANDRA-13948: ----------------------------------------- bq. This ticket is getting quite big and very hard to review I tried to make things easier by splitting in different commits, but I agree it became a bit complicated for review. bq. Could we split out all the pre-existing bugs in other tickets and get them committed separately? Especially this as it involves tokenmetadata. The problem is that some bugs (even though were pre-existing) only started showing up after this, so they have a dependency on this. I reorganized [this branch|https://github.com/pauloricardomg/cassandra/tree/3.11-13948] to keep only things essential to this ticket, created CASSANDRA-14079 and CASSANDRA-14081 with unrelated minor fixes, and will create two follow-up tickets which depend on this. This should be ready for review now, please let me know if some of the changes are not clear for you and needs better explanation. CI looked clean before the reorganization, but I will resubmit with the essential ticket just to make sure we didn't miss anything: * [3.11 patch|https://github.com/pauloricardomg/cassandra/tree/3.11-13948] * [dtest|https://github.com/pauloricardomg/cassandra-dtest/tree/13948] > Reload compaction strategies when JBOD disk boundary changes > ------------------------------------------------------------ > > Key: CASSANDRA-13948 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13948 > Project: Cassandra > Issue Type: Bug > Components: Compaction > Reporter: Paulo Motta > Assignee: Paulo Motta > Fix For: 3.11.x, 4.x > > Attachments: debug.log, dtest13948.png, dtest2.png, > threaddump-cleanup.txt, threaddump.txt, trace.log > > > The thread dump below shows a race between an sstable replacement by the > {{IndexSummaryRedistribution}} and > {{AbstractCompactionTask.getNextBackgroundTask}}: > {noformat} > Thread 94580: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() > @bci=1, line=836 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, > int) @bci=67, line=870 (Compiled frame) > - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) > @bci=17, line=1199 (Compiled frame) > - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, > line=943 (Compiled frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.handleListChangedNotification(java.lang.Iterable, > java.lang.Iterable) @bci=359, line=483 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(org.apache.cassandra.notifications.INotification, > java.lang.Object) @bci=53, line=555 (Interpreted frame) > - > org.apache.cassandra.db.lifecycle.Tracker.notifySSTablesChanged(java.util.Collection, > java.util.Collection, org.apache.cassandra.db.compaction.OperationType, > java.lang.Throwable) @bci=50, line=409 (Interpreted frame) > - > org.apache.cassandra.db.lifecycle.LifecycleTransaction.doCommit(java.lang.Throwable) > @bci=157, line=227 (Interpreted frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit(java.lang.Throwable) > @bci=61, line=116 (Compiled frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit() > @bci=2, line=200 (Interpreted frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish() > @bci=5, line=185 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryRedistribution.redistributeSummaries() > @bci=559, line=130 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionManager.runIndexSummaryRedistribution(org.apache.cassandra.io.sstable.IndexSummaryRedistribution) > @bci=9, line=1420 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(org.apache.cassandra.io.sstable.IndexSummaryRedistribution) > @bci=4, line=250 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries() > @bci=30, line=228 (Interpreted frame) > - org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow() > @bci=4, line=125 (Interpreted frame) > - org.apache.cassandra.utils.WrappedRunnable.run() @bci=1, line=28 > (Interpreted frame) > - > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run() > @bci=4, line=118 (Compiled frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 > (Compiled frame) > - java.util.concurrent.FutureTask.runAndReset() @bci=47, line=308 (Compiled > frame) > - > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask) > @bci=1, line=180 (Compiled frame) > - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() > @bci=37, line=294 (Compiled frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > @bci=95, line=1149 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 > (Interpreted frame) > - > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable) > @bci=1, line=81 (Interpreted frame) > - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=748 (Compiled frame) > {noformat} > {noformat} > Thread 94573: (state = IN_JAVA) > - java.util.HashMap$HashIterator.nextNode() @bci=95, line=1441 (Compiled > frame; information may be imprecise) > - java.util.HashMap$KeyIterator.next() @bci=1, line=1461 (Compiled frame) > - > org.apache.cassandra.db.lifecycle.View$3.apply(org.apache.cassandra.db.lifecycle.View) > @bci=20, line=268 (Compiled frame) > - org.apache.cassandra.db.lifecycle.View$3.apply(java.lang.Object) @bci=5, > line=265 (Compiled frame) > - > org.apache.cassandra.db.lifecycle.Tracker.apply(com.google.common.base.Predicate, > com.google.common.base.Function) @bci=13, line=133 (Compiled frame) > - org.apache.cassandra.db.lifecycle.Tracker.tryModify(java.lang.Iterable, > org.apache.cassandra.db.compaction.OperationType) @bci=31, line=99 (Compiled > frame) > - > org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) > @bci=84, line=139 (Compiled frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(int) > @bci=105, line=119 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run() > @bci=84, line=265 (Interpreted frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 > (Compiled frame) > - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > @bci=95, line=1149 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 > (Interpreted frame) > - > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable) > @bci=1, line=81 (Interpreted frame) > - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=748 (Compiled frame) > {noformat} > This particular node remain in this state forever, indicating > {{LeveledCompactionStrategyTask.getNextBackgroundTask}} was looping > indefinitely. > What happened is that sstable references were replaced on the tracker by the > {{IndexSummaryRedistribution}} thread, so the > {{AbstractCompactionStrategy.getNextBackgroundTask}} could not create the > transaction with the old references, and the {{IndexSummaryRedistribution}} > could not update the sstable reference in the compaction strategy because > {{AbstractCompactionStrategy.getNextBackgroundTask}} was holding the > {{CompactionStrategyManager}} lock. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org