[ 
https://issues.apache.org/jira/browse/HBASE-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612154#comment-16612154
 ] 

Sean Busbey edited comment on HBASE-21187 at 9/12/18 1:55 PM:
--------------------------------------------------------------

How does the machine info compare for the two? Maybe we have a thrashing 
neighbor on the node?

I believe our Yetus version is new enough that it should have the [Process 
Reaper|http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper]
 functionality. IIRC, it came with some underlying functionality to monitor 
processes (e.g. the "process+thread count" in current reports).  we could make 
a plugin that uses the same thing to e.g. measure CPU or memory use as 
precommit goes.


was (Author: busbey):
How does the machine info compare for the two? Maybe we have a thrashing 
neighbor on the node?

I believe our Yetus version is new enough that it should have the 
[http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper|Process
 Reaper] functionality. IIRC, it came with some underlying functionality to 
monitor processes (e.g. the "process+thread count" in current reports).  we 
could make a plugin that uses the same thing to e.g. measure CPU or memory use 
as precommit goes.

> The HBase UTs are extremely slow on some jenkins node
> -----------------------------------------------------
>
>                 Key: HBASE-21187
>                 URL: https://issues.apache.org/jira/browse/HBASE-21187
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Duo Zhang
>            Priority: Major
>
> Looking at the flaky dashboard for master branch, the top several UTs are 
> likely to fail at the same time. One of the common things for the failed 
> flaky tests job is that, the execution time is more than one hour, and the 
> successful executions are usually only about half an hour.
> And I have compared the output for 
> TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the 
> DisableTableProcedure can finish within one second, and for the failed run, 
> it can take even more than half a minute.
> Not sure what is the real problem, but it seems that for the failed runs, 
> there are likely time holes in the output, i.e, there is no log output for 
> several seconds. Like this:
> {noformat}
> 2018-09-11 21:08:08,152 INFO  [PEWorker-4] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 12.9380sec
> 2018-09-11 21:08:15,590 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> No log output for about 7 seconds.
> And for a successful run, the same place
> {noformat}
> 2018-09-12 07:47:32,488 INFO  [PEWorker-7] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 1.2220sec
> 2018-09-12 07:47:32,881 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> There is no such hole.
> Maybe there is big GC?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to