[ https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209985#comment-14209985 ]
Andrew Purtell commented on HBASE-12457: ---------------------------------------- I can see a TestRegionReplicas hang. We are getting hung up on waiting for a HTable thread pool to terminate: {noformat} "Thread-2297" prio=10 tid=0x00007feee0d1c800 nid=0x6173 waiting on condition [0x00007fee508c6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000078e04d4c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1490) at org.apache.hadoop.hbase.regionserver.TestRegionReplicas.afterClass(TestRegionReplicas.java:107) at org.apache.hadoop.hbase.regionserver.TestRegionReplicas.restartRegionServer(TestRegionReplicas.java:220) at org.apache.hadoop.hbase.regionserver.TestRegionReplicas.testVerifySecondaryAbilityToReadWithOnFiles(TestRegionReplicas.java:421) {noformat} A worker thread in the HTable thread pool is hung up trying to get table state: {noformat} "htable-pool53-t2" daemon prio=10 tid=0x00007feea454c000 nid=0x566e waiting on condition [0x00007feec0365000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1487) - locked <0x000000078cc03140> (a java.lang.Object) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1522) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1727) - locked <0x000000078cc03140> (a java.lang.Object) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTableState(ConnectionManager.java:2504) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:894) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1064) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:289) at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:135) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:294) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:275) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} Not sure how this relates to any compaction changes. At first glance it doesn't seem to. > Regions in transition for a long time when CLOSE interleaves with a slow > compaction > ----------------------------------------------------------------------------------- > > Key: HBASE-12457 > URL: https://issues.apache.org/jira/browse/HBASE-12457 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.7 > Reporter: Lars Hofhansl > Assignee: Lars Hofhansl > Fix For: 2.0.0, 0.98.8, 0.99.2 > > Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, > 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, > 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch > > > Under heave load we have observed regions remaining in transition for 20 > minutes when the master requests a close while a slow compaction is running. > The pattern is always something like this: > # RS starts a compaction > # HM request the region to be closed on this RS > # Compaction is not aborted for another 20 minutes > # The region is in transition and not usable. > In every case I tracked down so far the time between the requested CLOSE and > abort of the compaction is almost exactly 20 minutes, which is suspicious. > Of course part of the issue is having compactions that take over 20 minutes, > but maybe we can do better here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)