[ 
https://issues.apache.org/jira/browse/HBASE-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631867#comment-16631867
 ] 

Duo Zhang commented on HBASE-21187:
-----------------------------------

This is build 993
{noformat}
07:41:03 up 81 days, 20 min,  0 users,  load average: 0.92, 0.51, 0.66
{noformat}
We passed.

994
{noformat}
10:09:37 up 81 days,  2:28,  0 users,  load average: 14.13, 12.51, 14.56
{noformat}
Lots of tests failed.

995
{noformat}
10:51:42 up 81 days,  3:15,  0 users,  load average: 3.84, 4.95, 7.31
{noformat}
Only TestRSGroups failed.

996
{noformat}
11:16:46 up 36 days, 13:35,  0 users,  load average: 13.16, 11.57, 11.42
{noformat}
Lots of tests failed.

997
{noformat}
11:46:21 up 81 days,  4:25,  0 users,  load average: 2.34, 3.17, 6.23
{noformat}
Only TestCompactingToCellFlatMapMemStore failed. And it is not because of 
timeout, just an assertion error, so this one is truly flaky...

So I think the problem is that, for TRSP, we will have one more procedure as we 
use a sub procedure to schedule the remote procedure to simplify the logic, so 
on a already loaded machine, and if there are lots of regions to 
assign/unassign, it will be slower as there are extra context switches, and 
lead to the timeout...


> The HBase UTs are extremely slow on some jenkins node
> -----------------------------------------------------
>
>                 Key: HBASE-21187
>                 URL: https://issues.apache.org/jira/browse/HBASE-21187
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Duo Zhang
>            Priority: Major
>
> Looking at the flaky dashboard for master branch, the top several UTs are 
> likely to fail at the same time. One of the common things for the failed 
> flaky tests job is that, the execution time is more than one hour, and the 
> successful executions are usually only about half an hour.
> And I have compared the output for 
> TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the 
> DisableTableProcedure can finish within one second, and for the failed run, 
> it can take even more than half a minute.
> Not sure what is the real problem, but it seems that for the failed runs, 
> there are likely time holes in the output, i.e, there is no log output for 
> several seconds. Like this:
> {noformat}
> 2018-09-11 21:08:08,152 INFO  [PEWorker-4] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 12.9380sec
> 2018-09-11 21:08:15,590 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> No log output for about 7 seconds.
> And for a successful run, the same place
> {noformat}
> 2018-09-12 07:47:32,488 INFO  [PEWorker-7] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 1.2220sec
> 2018-09-12 07:47:32,881 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> There is no such hole.
> Maybe there is big GC?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to