[ 
https://issues.apache.org/jira/browse/ACCUMULO-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276542#comment-14276542
 ] 

Josh Elser commented on ACCUMULO-3471:
--------------------------------------

Looking at the code and jstack of a tserver, I think I just convinced myself 
the numbers were moving faster. Practically all of the threads are stuck trying 
to get the recovery lock instead of actually doing anything. Then, like Denis 
said, there's one assignment updating the metadata table.

{noformat}
"tablet assignment 1" daemon prio=10 tid=0x000000000269e800 nid=0x5bc8 in 
Object.wait() [0x00007f47b9a2c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.waitRTE(TabletServerBatchWriter.java:438)
        at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.close(TabletServerBatchWriter.java:340)
        - locked <0x0000000605fe16f8> (a 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter)
        at 
org.apache.accumulo.core.client.impl.BatchWriterImpl.close(BatchWriterImpl.java:54)
        at 
org.apache.accumulo.server.master.state.MetaDataStateStore.setLocations(MetaDataStateStore.java:80)
        at 
org.apache.accumulo.server.master.state.TabletStateStore.setLocation(TabletStateStore.java:83)
        at 
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2143)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at 
org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
        at 
org.apache.accumulo.core.trace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at 
org.apache.accumulo.core.trace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)

"tablet assignment 2" daemon prio=10 tid=0x0000000002118000 nid=0x5bca waiting 
on condition [0x00007f47b9c2e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000061aff0998> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:229)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at 
org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2201)
        at 
org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:246)
        at 
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2118)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at 
org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
        at 
org.apache.accumulo.core.trace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at 
org.apache.accumulo.core.trace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
{noformat}



> Adding a new tserver puts some tables offline for few minutes
> -------------------------------------------------------------
>
>                 Key: ACCUMULO-3471
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3471
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 12.04
>            Reporter: Denis Petrov
>             Fix For: 1.6.2, 1.7.0
>
>         Attachments: ACCUMULO-3471-balance-test.patch
>
>
> I run an Accumulo cluster with 15 tservers with about 6000 tablets on each 
> (disks are quite slow - each node has 2*4Tb SATA)
> When a new tserver added to the cluster, the rebalancing procedure starts.
> During this procedure some tablets are offline and unreachable during 5-10 
> minutes.
> It is visible in http://monitor:50095/tables and by timeouts on client side.
> The rebalancing caused by killing a tserver converges much faster then 
> rebalancing caused by adding a tserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to