[
https://issues.apache.org/jira/browse/HBASE-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612154#comment-16612154
]
Sean Busbey edited comment on HBASE-21187 at 9/12/18 1:55 PM:
--------------------------------------------------------------
How does the machine info compare for the two? Maybe we have a thrashing
neighbor on the node?
I believe our Yetus version is new enough that it should have the [Process
Reaper|http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper]
functionality. IIRC, it came with some underlying functionality to monitor
processes (e.g. the "process+thread count" in current reports). we could make
a plugin that uses the same thing to e.g. measure CPU or memory use as
precommit goes.
was (Author: busbey):
How does the machine info compare for the two? Maybe we have a thrashing
neighbor on the node?
I believe our Yetus version is new enough that it should have the
[http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper|Process
Reaper] functionality. IIRC, it came with some underlying functionality to
monitor processes (e.g. the "process+thread count" in current reports). we
could make a plugin that uses the same thing to e.g. measure CPU or memory use
as precommit goes.
> The HBase UTs are extremely slow on some jenkins node
> -----------------------------------------------------
>
> Key: HBASE-21187
> URL: https://issues.apache.org/jira/browse/HBASE-21187
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: Duo Zhang
> Priority: Major
>
> Looking at the flaky dashboard for master branch, the top several UTs are
> likely to fail at the same time. One of the common things for the failed
> flaky tests job is that, the execution time is more than one hour, and the
> successful executions are usually only about half an hour.
> And I have compared the output for
> TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the
> DisableTableProcedure can finish within one second, and for the failed run,
> it can take even more than half a minute.
> Not sure what is the real problem, but it seems that for the failed runs,
> there are likely time holes in the output, i.e, there is no log output for
> several seconds. Like this:
> {noformat}
> 2018-09-11 21:08:08,152 INFO [PEWorker-4]
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS,
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in
> 12.9380sec
> 2018-09-11 21:08:15,590 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663]
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> No log output for about 7 seconds.
> And for a successful run, the same place
> {noformat}
> 2018-09-12 07:47:32,488 INFO [PEWorker-7]
> procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS,
> hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in
> 1.2220sec
> 2018-09-12 07:47:32,881 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079]
> master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> There is no such hole.
> Maybe there is big GC?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)