keith-turner opened a new issue #558: Saw deadlock when two tablets tried to close around the same time. URL: https://github.com/apache/accumulo/issues/558 Saw a deadlock while running continuous ingest to test 1.9.2 RC1. I was looking into why a long hold time was happening. Luckily I got this stack trace before the agitator wacked the tserver. What is going on is that two tablets both try close around the same time. The two tablets minor compact, with that tablet lock held (which only happens at close). At the end of the minor compaction, the tablets run a check to see which WALs are referenced. This check attempts to get each tablets lock. Locking needs to be avoided in this check. I am not 100% sure, but this issue may occur in 1.9.1 with locking that was added in 84791ec78086474b4b69281c72aab5c3983831b0 for the `removeInUseLogs()` method. It may be that the only continuous ingest test that was done for 1.9.1 was with agitation. This is why its important to test with and without agitation, because agitation hides bugs like this unless someone is closely watching the test. I got lucky when I found this. ``` Java stack information for the threads listed above: =================================================== "Minor compacting !0;~<": at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462) - waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413) at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370) at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421) at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245) at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459) at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914) at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90) at org.apache.accumulo.tserver.tablet.Tablet.minorCompactNow(Tablet.java:1047) at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2388) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:64) at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:748) "Minor compacting 2;42005;41804": at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462) - waiting to lock <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413) at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370) at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421) at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245) at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459) at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914) at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90) at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428) - locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:2291) - locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2109) at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2089) at org.apache.accumulo.tserver.TabletServer.access$2300(TabletServer.java:271) at org.apache.accumulo.tserver.TabletServer$SplitRunner.run(TabletServer.java:1978) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:748) "Minor compacting 2;69c0cc;6980bb": at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462) - waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413) at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370) at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421) at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245) at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459) at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914) at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90) at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428) - locked <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet) at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1318) at org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2206) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
