[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209985#comment-14209985
 ] 

Andrew Purtell commented on HBASE-12457:
----------------------------------------

I can see a TestRegionReplicas hang. We are getting hung up on waiting for a 
HTable thread pool to terminate:
{noformat}
"Thread-2297" prio=10 tid=0x00007feee0d1c800 nid=0x6173 waiting on condition 
[0x00007fee508c6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000078e04d4c8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
        at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1490)
        at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.afterClass(TestRegionReplicas.java:107)
        at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.restartRegionServer(TestRegionReplicas.java:220)
        at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.testVerifySecondaryAbilityToReadWithOnFiles(TestRegionReplicas.java:421)
{noformat}

A worker thread in the HTable thread pool is hung up trying to get table state:

{noformat}
"htable-pool53-t2" daemon prio=10 tid=0x00007feea454c000 nid=0x566e waiting on 
condition [0x00007feec0365000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1487)
        - locked <0x000000078cc03140> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1522)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1727)
        - locked <0x000000078cc03140> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTableState(ConnectionManager.java:2504)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:894)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1064)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:289)
        at 
org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:135)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:294)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:275)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

Not sure how this relates to any compaction changes. At first glance it doesn't 
seem to.


> Regions in transition for a long time when CLOSE interleaves with a slow 
> compaction
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-12457
>                 URL: https://issues.apache.org/jira/browse/HBASE-12457
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.7
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>
>         Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
> 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
> 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch
>
>
> Under heave load we have observed regions remaining in transition for 20 
> minutes when the master requests a close while a slow compaction is running.
> The pattern is always something like this:
> # RS starts a compaction
> # HM request the region to be closed on this RS
> # Compaction is not aborted for another 20 minutes
> # The region is in transition and not usable.
> In every case I tracked down so far the time between the requested CLOSE and 
> abort of the compaction is almost exactly 20 minutes, which is suspicious.
> Of course part of the issue is having compactions that take over 20 minutes, 
> but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to