[ 
https://issues.apache.org/jira/browse/HBASE-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612168#comment-16612168
 ] 

Sean Busbey commented on HBASE-21187:
-------------------------------------

{quote}
I believe our Yetus version is new enough that it should have the Process 
Reaper functionality. IIRC, it came with some underlying functionality to 
monitor processes (e.g. the "process+thread count" in current reports). we 
could make a plugin that uses the same thing to e.g. measure CPU or memory use 
as precommit goes.
{quote}

We'd have to convert our flaky job to rely on yetus for this to help. Might be 
useful anyways since it would mean we could easily run in docker w/ all the 
protection against wild tests that we already have there. Shouldn't be too hard 
given that our personality already supports a filter for "only run these tests" 
and Yetus 0.8.0 adds a "only run tests named in this file" cli option for maven 
builds.

> The HBase UTs are extremely slow on some jenkins node
> -----------------------------------------------------
>
>                 Key: HBASE-21187
>                 URL: https://issues.apache.org/jira/browse/HBASE-21187
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Duo Zhang
>            Priority: Major
>
> Looking at the flaky dashboard for master branch, the top several UTs are 
> likely to fail at the same time. One of the common things for the failed 
> flaky tests job is that, the execution time is more than one hour, and the 
> successful executions are usually only about half an hour.
> And I have compared the output for 
> TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the 
> DisableTableProcedure can finish within one second, and for the failed run, 
> it can take even more than half a minute.
> Not sure what is the real problem, but it seems that for the failed runs, 
> there are likely time holes in the output, i.e, there is no log output for 
> several seconds. Like this:
> {noformat}
> 2018-09-11 21:08:08,152 INFO  [PEWorker-4] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 12.9380sec
> 2018-09-11 21:08:15,590 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> No log output for about 7 seconds.
> And for a successful run, the same place
> {noformat}
> 2018-09-12 07:47:32,488 INFO  [PEWorker-7] 
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, 
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 
> 1.2220sec
> 2018-09-12 07:47:32,881 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079] 
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> There is no such hole.
> Maybe there is big GC?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to