keith-turner opened a new issue #558: Saw deadlock when two tablets tried to 
close around the same time.
URL: https://github.com/apache/accumulo/issues/558
 
 
   Saw a deadlock while running continuous ingest to test 1.9.2 RC1.  I was 
looking into why a long hold time was happening.  Luckily I got this stack 
trace before the agitator wacked the tserver.  
   
   What is going on is that two tablets both try close around the same time.  
The two tablets minor compact, with that tablet lock held (which only happens 
at close).  At the end of the minor compaction, the tablets run a check to see 
which WALs are referenced.  This check attempts to get each tablets lock.  
Locking needs to be avoided in this check.
   
   I am not 100% sure, but this issue may occur in 1.9.1 with locking that was 
added in 84791ec78086474b4b69281c72aab5c3983831b0 for the `removeInUseLogs()` 
method.  It may be that the only continuous ingest test that was done for 1.9.1 
was with agitation.  This is why its important to test with and without 
agitation, because agitation hides bugs like this unless someone is closely 
watching the test. I got lucky when I found this.
   
   ```
   Java stack information for the threads listed above:
   ===================================================
   "Minor compacting !0;~<":
           at 
org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
           - waiting to lock <0x000000079ca85988> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at 
org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
           at 
org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
           at 
org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
           at 
org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
           at 
org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
           at 
org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
           at 
org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
           at 
org.apache.accumulo.tserver.tablet.Tablet.minorCompactNow(Tablet.java:1047)
           at 
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2388)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at 
org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:64)
           at 
org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at java.lang.Thread.run(Thread.java:748)
   "Minor compacting 2;42005;41804":
           at 
org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
           - waiting to lock <0x0000000794a37d20> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at 
org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
           at 
org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
           at 
org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
           at 
org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
           at 
org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
           at 
org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
           at 
org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
           at 
org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
           - locked <0x000000079ca85988> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:2291)
           - locked <0x000000079ca85988> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at 
org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2109)
           at 
org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2089)
           at 
org.apache.accumulo.tserver.TabletServer.access$2300(TabletServer.java:271)
           at 
org.apache.accumulo.tserver.TabletServer$SplitRunner.run(TabletServer.java:1978)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at 
org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at java.lang.Thread.run(Thread.java:748)
   "Minor compacting 2;69c0cc;6980bb":
           at 
org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
           - waiting to lock <0x000000079ca85988> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at 
org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
           at 
org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
           at 
org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
           at 
org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
           at 
org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
           at 
org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
           at 
org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
           at 
org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
           - locked <0x0000000794a37d20> (a 
org.apache.accumulo.tserver.tablet.Tablet)
           at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1318)
           at 
org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2206)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at 
org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
           at java.lang.Thread.run(Thread.java:748)
   
   Found 1 deadlock.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to