[
https://issues.apache.org/jira/browse/HBASE-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361854#comment-16361854
]
stack commented on HBASE-19965:
-------------------------------
It looks like the test can go on longer than our ten minute timeout. Here is
complaint from last two failures in nightlies.
-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
-------------------------------------------------------------------------------
Tests run: 25, Failures: 0, Errors: 2, Skipped: 2, Time elapsed: 572.508 s <<<
FAILURE! - in org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi Time elapsed: 14.642 s
<<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
at
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testMergeRegions(TestAsyncRegionAdminApi.java:363)
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi Time elapsed: 14.642 s
<<< ERROR!
java.lang.Exception: Appears to be stuck in thread Socket Reader #1 for port
35917
-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
-------------------------------------------------------------------------------
Tests run: 24, Failures: 0, Errors: 2, Skipped: 2, Time elapsed: 574.354 s <<<
FAILURE! - in org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi Time elapsed: 7.735 s
<<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
at
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testSplitSwitch(TestAsyncRegionAdminApi.java:285)
org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi Time elapsed: 7.772 s
<<< ERROR!
java.lang.Exception: Appears to be stuck in thread Socket Reader #1 for port
43069
See how there are 20 odd tests run. See how last one was running for 14 seconds
yet we timed out after 600 seconds. Here are start times for the latest run:
{code}
1 2018-02-13 02:58:52,380 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompactRegionServer[0] Thread=302,
OpenFileDescriptor=1612, MaxFileDescriptor=1048576, SystemLoadAverage=1886,
ProcessCount=20, AvailableMemoryMB=9729
2 2018-02-13 02:59:21,149 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testGetOnlineRegions[0] Thread=324,
OpenFileDescriptor=1625, MaxFileDescriptor=1048576, SystemLoadAverage=1887,
ProcessCount=17, AvailableMemoryMB=9766
3 2018-02-13 02:59:32,669 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompactMob[0] Thread=367,
OpenFileDescriptor=1616, MaxFileDescriptor=1048576, SystemLoadAverage=1814,
ProcessCount=17, AvailableMemoryMB=8805
4 2018-02-13 02:59:50,492 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testSplitTable[0] Thread=373,
OpenFileDescriptor=1630, MaxFileDescriptor=1048576, SystemLoadAverage=1763,
ProcessCount=17, AvailableMemoryMB=8683
5 2018-02-13 03:01:18,069 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testAssignRegionAndUnassignRegion[0] Thread=380,
OpenFileDescriptor=1581, MaxFileDescriptor=1048576, SystemLoadAverage=1761,
ProcessCount=17, AvailableMemoryMB=9252
6 2018-02-13 03:01:28,706 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompact[0] Thread=386,
OpenFileDescriptor=1586, MaxFileDescriptor=1048576, SystemLoadAverage=1804,
ProcessCount=17, AvailableMemoryMB=10421
7 2018-02-13 03:02:36,046 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testGetRegionByStateOfTable[0] Thread=360,
OpenFileDescriptor=1573, MaxFileDescriptor=1048576, SystemLoadAverage=1778,
ProcessCount=17, AvailableMemoryMB=10053
8 2018-02-13 03:02:51,851 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testFlushTableAndRegion[0] Thread=387,
OpenFileDescriptor=1581, MaxFileDescriptor=1048576, SystemLoadAverage=1773,
ProcessCount=20, AvailableMemoryMB=9379
9 2018-02-13 03:03:09,421 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testSplitSwitch[0] Thread=405,
OpenFileDescriptor=1590, MaxFileDescriptor=1048576, SystemLoadAverage=1809,
ProcessCount=17, AvailableMemoryMB=8787
10 2018-02-13 03:03:33,784 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testMergeRegions[0] Thread=413,
OpenFileDescriptor=1595, MaxFileDescriptor=1048576, SystemLoadAverage=1795,
ProcessCount=17, AvailableMemoryMB=9152
11 2018-02-13 03:04:01,365 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testGetRegionLocation[0] Thread=414,
OpenFileDescriptor=1579, MaxFileDescriptor=1048576, SystemLoadAverage=1941,
ProcessCount=17, AvailableMemoryMB=9509
12 2018-02-13 03:04:16,832 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testMoveRegion[0] Thread=516,
OpenFileDescriptor=1584, MaxFileDescriptor=1048576, SystemLoadAverage=1914,
ProcessCount=17, AvailableMemoryMB=9343
13 2018-02-13 03:04:31,078 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompactRegionServer[1] Thread=533,
OpenFileDescriptor=1587, MaxFileDescriptor=1048576, SystemLoadAverage=1859,
ProcessCount=18, AvailableMemoryMB=9435
14 2018-02-13 03:04:53,780 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testGetOnlineRegions[1] Thread=516,
OpenFileDescriptor=1615, MaxFileDescriptor=1048576, SystemLoadAverage=1963,
ProcessCount=17, AvailableMemoryMB=8486
15 2018-02-13 03:05:02,060 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompactMob[1] Thread=530,
OpenFileDescriptor=1611, MaxFileDescriptor=1048576, SystemLoadAverage=1904,
ProcessCount=17, AvailableMemoryMB=7932
16 2018-02-13 03:05:15,861 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testSplitTable[1] Thread=440,
OpenFileDescriptor=1627, MaxFileDescriptor=1048576, SystemLoadAverage=2889,
ProcessCount=17, AvailableMemoryMB=8820
17 2018-02-13 03:06:28,128 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testAssignRegionAndUnassignRegion[1] Thread=415,
OpenFileDescriptor=1573, MaxFileDescriptor=1048576, SystemLoadAverage=2198,
ProcessCount=17, AvailableMemoryMB=8435
18 2018-02-13 03:06:38,684 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testCompact[1] Thread=415,
OpenFileDescriptor=1584, MaxFileDescriptor=1048576, SystemLoadAverage=2224,
ProcessCount=17, AvailableMemoryMB=9374
19 2018-02-13 03:07:49,370 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testGetRegionByStateOfTable[1] Thread=387,
OpenFileDescriptor=1578, MaxFileDescriptor=1048576, SystemLoadAverage=2212,
ProcessCount=17, AvailableMemoryMB=9501
20 2018-02-13 03:08:04,891 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testFlushTableAndRegion[1] Thread=407,
OpenFileDescriptor=1578, MaxFileDescriptor=1048576, SystemLoadAverage=2373,
ProcessCount=17, AvailableMemoryMB=7516
21 2018-02-13 03:08:18,921 INFO [Time-limited test]
hbase.ResourceChecker(148): before:
client.TestAsyncRegionAdminApi#testSplitSwitch[1] Thread=429,
OpenFileDescriptor=1590, MaxFileDescriptor=1048576, SystemLoadAverage=2329,
ProcessCount=17, AvailableMemoryMB=8346
{code}
Nothing taking particularly long...but in aggregate we can see it add up to
close to ten minutes.
Failures of this test are always in the last two test runs... testMergeRegions
or the one before testSplitSwitch.
Of the note the thread dump on timeout seems to be working... Looks like this:
{code}
2018-02-13 03:08:26,355 INFO [PEWorker-9] assignment.RegionStateStore(201):
pid=349 updating hbase:meta
row=testSplitSwitch,,1518491298922.12b2541384b1b1dfd77b118e09a9633c.,
regionState=CLOSED
2018-02-13 03:08:26,355 DEBUG [RS_CLOSE_REGION-regionserver/49484dcb1f1c:0-1]
handler.CloseRegionHandler(125): Closed
testSplitSwitch,,1518491298922.12b2541384b1b1dfd77b118e09a9633c.
====> TEST TIMED OUT. PRINTING THREAD DUMP. <====
Timestamp: 2018-02-13 03:08:26,635
"IPC Server handler 3 on 43069" daemon prio=5 tid=53 timed_waiting
java.lang.Thread.State: TIMED_WAITING
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:136)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2185)
"RS-EventLoopGroup-4-24" daemon prio=10 tid=824 runnable
...
{code}
Let me break up this test.
> Fix flaky TestAsyncRegionAdminApi
> ---------------------------------
>
> Key: HBASE-19965
> URL: https://issues.apache.org/jira/browse/HBASE-19965
> Project: HBase
> Issue Type: Sub-task
> Reporter: Guanghao Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0-beta-2
>
>
> See
> [https://builds.apache.org/job/HBase%20Nightly/job/branch-2/284/testReport/junit/org.apache.hadoop.hbase.client/TestAsyncRegionAdminApi/testMergeRegions_0_/]
>
> java.lang.AssertionError: expected:<2> but was:<3> at
> org.apache.hadoop.hbase.client.TestAsyncRegionAdminApi.testMergeRegions(TestAsyncRegionAdminApi.java:359)
>
> Merge regions not work. The table still have 3 regions after the
> MergeRegionsProcedure finished.
> The master start balance region 9e2773ba1efba79a2defa276e9a26ed4. But because
> the MergeRegionsProcedure pid=138 start work first, so the balance need wait
> for the lock. But after merge regions finished, the MoveRegionProcedure
> pid=139 start work and assign 9e2773ba1efba79a2defa276e9a26ed4 to a new
> region server. This is not right. The MoveRegionProcedure should skip to
> assign a region which was marked as offline. Or we should clear the merged
> regions' procedure when MergeRegionsProcedure finished.
>
> Logs:
> 2018-02-08 16:24:44,608 INFO [master/cd4730e3eae2:0.Chore.1]
> master.HMaster(1454): balance
> hri=testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4.,
> source=cd4730e3eae2,39077,1518106776411,
> destination=cd4730e3eae2,40578,1518106776318
> 2018-02-08 16:24:44,608 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=37885]
> procedure2.ProcedureExecutor(868): Stored pid=138,
> state=RUNNABLE:MERGE_TABLE_REGIONS_PREPARE; MergeTableRegionsProcedure
> table=testMergeRegions, regions=[9e2773ba1efba79a2defa276e9a26ed4,
> 8f8fd5cd032313e1aadb83e31e1b7479], forcibly=false
> ......
> 2018-02-08 16:24:50,111 INFO [PEWorker-13]
> procedure2.ProcedureExecutor(1249): Finished pid=138, state=SUCCESS;
> MergeTableRegionsProcedure table=testMergeRegions,
> regions=[9e2773ba1efba79a2defa276e9a26ed4, 8f8fd5cd032313e1aadb83e31e1b7479],
> forcibly=false in 5.5710sec
> 2018-02-08 16:24:50,113 INFO [PEWorker-13]
> procedure.MasterProcedureScheduler(813): pid=139,
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4.,
> source=cd4730e3eae2,39077,1518106776411,
> destination=cd4730e3eae2,40578,1518106776318 testMergeRegions
> testMergeRegions,,1518107079782.9e2773ba1efba79a2defa276e9a26ed4.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)