[
https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884498#comment-13884498
]
Eric Newton commented on ACCUMULO-2269:
---------------------------------------
When a bulk load is run, the updates to the tablets check to make sure the bulk
load is still in progress. If not, the files may have already been moved away.
Something was jamming up the tablet servers and they were just processing old
requests.
> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
> Key: ACCUMULO-2269
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Environment: 1.5.1-SNAPSHOT: 8981ba04
> Reporter: Josh Elser
> Priority: Critical
> Fix For: 1.5.1
>
>
> Was running LongClean randomwalk with agitation. Came back to the system with
> three tables "stuck" in DELETING on the monitor and a generally idle system.
> Upon investigation, multiple fate txns appear to be deadlocked, in addition
> to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17 status: IN_PROGRESS op: DeleteTable
> locked: [W:2dc] locking: [] top: CleanUp
> txid: 1071086efdbed442 status: IN_PROGRESS op: BulkImport
> locked: [R:2cr] locking: [] top: LoadFiles
> txid: 32b86cfe06c2ed5d status: IN_PROGRESS op: DeleteTable
> locked: [W:2d9] locking: [] top: CleanUp
> txid: 358c065b6cb0516b status: IN_PROGRESS op: DeleteTable
> locked: [W:2dw] locking: [] top: CleanUp
> txid: 26b738ee0b044a96 status: IN_PROGRESS op: BulkImport
> locked: [R:2cr] locking: [] top: CopyFailed
> txid: 16edd31b3723dc5b status: IN_PROGRESS op: BulkImport
> locked: [R:2cr] locking: [] top: CopyFailed
> txid: 63c587eb3df6c1b2 status: IN_PROGRESS op: CompactRange
> locked: [R:2cr] locking: [] top: CompactionDriver
> txid: 722d8e5488531735 status: IN_PROGRESS op: BulkImport
> locked: [R:2cr] locking: [] top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be
> active and holds the table_lock for their respective table in ZK, but the
> /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3
> of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on
> condition [0x00007f25168e7000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000705a05eb8> (a
> java.util.concurrent.FutureTask)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
> at java.util.concurrent.FutureTask.get(FutureTask.java:187)
> at
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
> at
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
> at
> org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's
> locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in
> Object.wait() [0x00007f25169e8000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:503)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
> - locked <0x00000007014d9928> (a
> org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
> at com.sun.proxy.$Proxy11.getData(Unknown Source)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still
> presently in this state.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)