[ 
https://issues.apache.org/jira/browse/HBASE-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361854#comment-16361854
 ] 

stack commented on HBASE-19965:
-------------------------------

It looks like the test can go on longer than our ten minute timeout. Here is 
complaint from last two failures in nightlies.

-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
-------------------------------------------------------------------------------
Tests run: 25, Failures: 0, Errors: 2, Skipped: 2, Time elapsed: 572.508 s <<< 
FAILURE! - in org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi  Time elapsed: 14.642 s  
<<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
  at 
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testMergeRegions(TestAsyncRegionAdminApi.java:363)

org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi  Time elapsed: 14.642 s  
<<< ERROR!
java.lang.Exception: Appears to be stuck in thread Socket Reader #1 for port 
35917



-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
-------------------------------------------------------------------------------
Tests run: 24, Failures: 0, Errors: 2, Skipped: 2, Time elapsed: 574.354 s <<< 
FAILURE! - in org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi  Time elapsed: 7.735 s  
<<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
  at 
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testSplitSwitch(TestAsyncRegionAdminApi.java:285)

org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi  Time elapsed: 7.772 s  
<<< ERROR!
java.lang.Exception: Appears to be stuck in thread Socket Reader #1 for port 
43069





See how there are 20 odd tests run. See how last one was running for 14 seconds 
yet we timed out after 600 seconds. Here are start times for the latest run:

{code}
  1 2018-02-13 02:58:52,380 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompactRegionServer[0] Thread=302, 
OpenFileDescriptor=1612, MaxFileDescriptor=1048576, SystemLoadAverage=1886, 
ProcessCount=20, AvailableMemoryMB=9729
  2 2018-02-13 02:59:21,149 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testGetOnlineRegions[0] Thread=324, 
OpenFileDescriptor=1625, MaxFileDescriptor=1048576, SystemLoadAverage=1887, 
ProcessCount=17, AvailableMemoryMB=9766
  3 2018-02-13 02:59:32,669 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompactMob[0] Thread=367, 
OpenFileDescriptor=1616, MaxFileDescriptor=1048576, SystemLoadAverage=1814, 
ProcessCount=17, AvailableMemoryMB=8805
  4 2018-02-13 02:59:50,492 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testSplitTable[0] Thread=373, 
OpenFileDescriptor=1630, MaxFileDescriptor=1048576, SystemLoadAverage=1763, 
ProcessCount=17, AvailableMemoryMB=8683
  5 2018-02-13 03:01:18,069 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testAssignRegionAndUnassignRegion[0] Thread=380, 
OpenFileDescriptor=1581, MaxFileDescriptor=1048576, SystemLoadAverage=1761, 
ProcessCount=17, AvailableMemoryMB=9252
  6 2018-02-13 03:01:28,706 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompact[0] Thread=386, 
OpenFileDescriptor=1586, MaxFileDescriptor=1048576, SystemLoadAverage=1804, 
ProcessCount=17, AvailableMemoryMB=10421
  7 2018-02-13 03:02:36,046 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testGetRegionByStateOfTable[0] Thread=360, 
OpenFileDescriptor=1573, MaxFileDescriptor=1048576, SystemLoadAverage=1778, 
ProcessCount=17, AvailableMemoryMB=10053
  8 2018-02-13 03:02:51,851 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testFlushTableAndRegion[0] Thread=387, 
OpenFileDescriptor=1581, MaxFileDescriptor=1048576, SystemLoadAverage=1773, 
ProcessCount=20, AvailableMemoryMB=9379
  9 2018-02-13 03:03:09,421 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testSplitSwitch[0] Thread=405, 
OpenFileDescriptor=1590, MaxFileDescriptor=1048576, SystemLoadAverage=1809, 
ProcessCount=17, AvailableMemoryMB=8787
 10 2018-02-13 03:03:33,784 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testMergeRegions[0] Thread=413, 
OpenFileDescriptor=1595, MaxFileDescriptor=1048576, SystemLoadAverage=1795, 
ProcessCount=17, AvailableMemoryMB=9152
 11 2018-02-13 03:04:01,365 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testGetRegionLocation[0] Thread=414, 
OpenFileDescriptor=1579, MaxFileDescriptor=1048576, SystemLoadAverage=1941, 
ProcessCount=17, AvailableMemoryMB=9509
 12 2018-02-13 03:04:16,832 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testMoveRegion[0] Thread=516, 
OpenFileDescriptor=1584, MaxFileDescriptor=1048576, SystemLoadAverage=1914, 
ProcessCount=17, AvailableMemoryMB=9343
 13 2018-02-13 03:04:31,078 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompactRegionServer[1] Thread=533, 
OpenFileDescriptor=1587, MaxFileDescriptor=1048576, SystemLoadAverage=1859, 
ProcessCount=18, AvailableMemoryMB=9435
 14 2018-02-13 03:04:53,780 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testGetOnlineRegions[1] Thread=516, 
OpenFileDescriptor=1615, MaxFileDescriptor=1048576, SystemLoadAverage=1963, 
ProcessCount=17, AvailableMemoryMB=8486
 15 2018-02-13 03:05:02,060 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompactMob[1] Thread=530, 
OpenFileDescriptor=1611, MaxFileDescriptor=1048576, SystemLoadAverage=1904, 
ProcessCount=17, AvailableMemoryMB=7932
 16 2018-02-13 03:05:15,861 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testSplitTable[1] Thread=440, 
OpenFileDescriptor=1627, MaxFileDescriptor=1048576, SystemLoadAverage=2889, 
ProcessCount=17, AvailableMemoryMB=8820
 17 2018-02-13 03:06:28,128 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testAssignRegionAndUnassignRegion[1] Thread=415, 
OpenFileDescriptor=1573, MaxFileDescriptor=1048576, SystemLoadAverage=2198, 
ProcessCount=17, AvailableMemoryMB=8435
 18 2018-02-13 03:06:38,684 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testCompact[1] Thread=415, 
OpenFileDescriptor=1584, MaxFileDescriptor=1048576, SystemLoadAverage=2224, 
ProcessCount=17, AvailableMemoryMB=9374
 19 2018-02-13 03:07:49,370 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testGetRegionByStateOfTable[1] Thread=387, 
OpenFileDescriptor=1578, MaxFileDescriptor=1048576, SystemLoadAverage=2212, 
ProcessCount=17, AvailableMemoryMB=9501
 20 2018-02-13 03:08:04,891 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testFlushTableAndRegion[1] Thread=407, 
OpenFileDescriptor=1578, MaxFileDescriptor=1048576, SystemLoadAverage=2373, 
ProcessCount=17, AvailableMemoryMB=7516
 21 2018-02-13 03:08:18,921 INFO  [Time-limited test] 
hbase.ResourceChecker(148): before: 
client.TestAsyncRegionAdminApi#testSplitSwitch[1] Thread=429, 
OpenFileDescriptor=1590, MaxFileDescriptor=1048576, SystemLoadAverage=2329, 
ProcessCount=17, AvailableMemoryMB=8346
{code}


Nothing taking particularly long...but in aggregate we can see it add up to 
close to ten minutes.

Failures of this test are always in the last two test runs... testMergeRegions 
or the one before testSplitSwitch.

Of the note the thread dump on timeout seems to be working... Looks like this:


{code}
2018-02-13 03:08:26,355 INFO  [PEWorker-9] assignment.RegionStateStore(201): 
pid=349 updating hbase:meta 
row=testSplitSwitch,,1518491298922.12b2541384b1b1dfd77b118e09a9633c., 
regionState=CLOSED
2018-02-13 03:08:26,355 DEBUG [RS_CLOSE_REGION-regionserver/49484dcb1f1c:0-1] 
handler.CloseRegionHandler(125): Closed 
testSplitSwitch,,1518491298922.12b2541384b1b1dfd77b118e09a9633c.
====> TEST TIMED OUT. PRINTING THREAD DUMP. <====

Timestamp: 2018-02-13 03:08:26,635

"IPC Server handler 3 on 43069" daemon prio=5 tid=53 timed_waiting
java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at 
org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:136)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2185)
"RS-EventLoopGroup-4-24" daemon prio=10 tid=824 runnable
...
{code}

Let me break up this test.

> Fix flaky TestAsyncRegionAdminApi
> ---------------------------------
>
>                 Key: HBASE-19965
>                 URL: https://issues.apache.org/jira/browse/HBASE-19965
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Guanghao Zhang
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0-beta-2
>
>
> See 
> [https://builds.apache.org/job/HBase%20Nightly/job/branch-2/284/testReport/junit/org.apache.hadoop.hbase.client/TestAsyncRegionAdminApi/testMergeRegions_0_/]
>  
> java.lang.AssertionError: expected:<2> but was:<3> at 
> org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testMergeRegions(TestAsyncRegionAdminApi.java:359)
>  
> Merge regions not work. The table still have 3 regions after the 
> MergeRegionsProcedure finished.
> The master start balance region 9e2773ba1efba79a2defa276e9a26ed4. But because 
> the MergeRegionsProcedure pid=138 start work first, so the balance need wait 
> for the lock. But after merge regions finished, the MoveRegionProcedure 
> pid=139 start work and assign 9e2773ba1efba79a2defa276e9a26ed4 to a new 
> region server. This is not right. The MoveRegionProcedure should skip to 
> assign a region which was marked as offline. Or we should clear the merged 
> regions' procedure when MergeRegionsProcedure finished.
>  
> Logs:
> 2018-02-08 16:24:44,608 INFO [master/cd4730e3eae2:0.Chore.1] 
> master.HMaster(1454): balance 
> hri=testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4., 
> source=cd4730e3eae2,39077,1518106776411, 
> destination=cd4730e3eae2,40578,1518106776318
> 2018-02-08 16:24:44,608 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=37885] 
> procedure2.ProcedureExecutor(868): Stored pid=138, 
> state=RUNNABLE:MERGE_TABLE_REGIONS_PREPARE; MergeTableRegionsProcedure 
> table=testMergeRegions, regions=[9e2773ba1efba79a2defa276e9a26ed4, 
> 8f8fd5cd032313e1aadb83e31e1b7479], forcibly=false
> ......
> 2018-02-08 16:24:50,111 INFO [PEWorker-13] 
> procedure2.ProcedureExecutor(1249): Finished pid=138, state=SUCCESS; 
> MergeTableRegionsProcedure table=testMergeRegions, 
> regions=[9e2773ba1efba79a2defa276e9a26ed4, 8f8fd5cd032313e1aadb83e31e1b7479], 
> forcibly=false in 5.5710sec
> 2018-02-08 16:24:50,113 INFO [PEWorker-13] 
> procedure.MasterProcedureScheduler(813): pid=139, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4., 
> source=cd4730e3eae2,39077,1518106776411, 
> destination=cd4730e3eae2,40578,1518106776318 testMergeRegions 
> testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to