[jira] [Commented] (ACCUMULO-2269) Multiple hung fate operations during randomwalk with agitation

Josh Elser (JIRA) Tue, 28 Jan 2014 11:32:10 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884469#comment-13884469
 ]


Josh Elser commented on ACCUMULO-2269:
--------------------------------------

Also, missed some obvious logs in the monitor:

{noformat}
error sending update to 192.168.56.183:9997: 
ConstraintViolationException(violationSummaries:[TConstraintViolationSummary(constrainClass:org.apache.accumulo.server.constraints.MetadataConstraints,
 violationCode:8, violationDescription:Bulk load transaction no longer running, 
numberOfViolatingMutations:1)])
{noformat}

Getting that repeatedly from 4 of the 5 tservers.

> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-2269
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>         Environment: 1.5.1-SNAPSHOT: 8981ba04
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.5.1
>
>
> Was running LongClean randomwalk with agitation. Came back to the system with 
> three tables "stuck" in DELETING on the monitor and a generally idle system. 
> Upon investigation, multiple fate txns appear to be deadlocked, in addition 
> to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2dc]         locking: []              top: CleanUp
> txid: 1071086efdbed442  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: LoadFiles
> txid: 32b86cfe06c2ed5d  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2d9]         locking: []              top: CleanUp
> txid: 358c065b6cb0516b  status: IN_PROGRESS         op: DeleteTable      
> locked: [W:2dw]         locking: []              top: CleanUp
> txid: 26b738ee0b044a96  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 16edd31b3723dc5b  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 63c587eb3df6c1b2  status: IN_PROGRESS         op: CompactRange     
> locked: [R:2cr]         locking: []              top: CompactionDriver
> txid: 722d8e5488531735  status: IN_PROGRESS         op: BulkImport       
> locked: [R:2cr]         locking: []              top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be 
> active and holds the table_lock for their respective table in ZK, but the 
> /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 
> of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on 
> condition [0x00007f25168e7000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000705a05eb8> (a 
> java.util.concurrent.FutureTask)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at 
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
>         at 
> org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
>         at 
> org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's 
> locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in 
> Object.wait() [0x00007f25169e8000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:503)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
>         - locked <0x00000007014d9928> (a 
> org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
>         at com.sun.proxy.$Proxy11.getData(Unknown Source)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still 
> presently in this state.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (ACCUMULO-2269) Multiple hung fate operations during randomwalk with agitation

Reply via email to