[ 
https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884498#comment-13884498
 ] 

Eric Newton commented on ACCUMULO-2269:
---------------------------------------

When a bulk load is run, the updates to the tablets check to make sure the bulk 
load is still in progress.  If not, the files may have already been moved away. 
 Something was jamming up the tablet servers and they were just processing old 
requests.


> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-2269
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>         Environment: 1.5.1-SNAPSHOT: 8981ba04
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.5.1
>
>
> Was running LongClean randomwalk with agitation. Came back to the system with 
> three tables "stuck" in DELETING on the monitor and a generally idle system. 
> Upon investigation, multiple fate txns appear to be deadlocked, in addition 
> to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2dc]         locking: []              top: CleanUp
> txid: 1071086efdbed442  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: LoadFiles
> txid: 32b86cfe06c2ed5d  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2d9]         locking: []              top: CleanUp
> txid: 358c065b6cb0516b  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2dw]         locking: []              top: CleanUp
> txid: 26b738ee0b044a96  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 16edd31b3723dc5b  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 63c587eb3df6c1b2  status: IN_PROGRESS         op: CompactRange     
> locked: [R:2cr]         locking: []              top: CompactionDriver
> txid: 722d8e5488531735  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be 
> active and holds the table_lock for their respective table in ZK, but the 
> /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 
> of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on 
> condition [0x00007f25168e7000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000705a05eb8> (a 
> java.util.concurrent.FutureTask)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at 
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
>         at 
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
>         at 
> org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's 
> locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in 
> Object.wait() [0x00007f25169e8000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:503)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
>         - locked <0x00000007014d9928> (a 
> org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
>         at com.sun.proxy.$Proxy11.getData(Unknown Source)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still 
> presently in this state.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to