[jira] [Comment Edited] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

Appy (JIRA) Thu, 25 Jan 2018 18:56:21 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340483#comment-16340483
 ]


Appy edited comment on HBASE-19803 at 1/26/18 2:55 AM:
-------------------------------------------------------

I think i have cracked it, it's basically this:
{color:red}Edit{color}: Adding more details.
- Some test goes bad (which one? - basically what we were trying to figure out) 
in a way that the jvm dies (Hence our category based timeout is useless here)
- Since the jvm died without reporting to surefire, the plugin's main process 
keeps waiting
- After 900 sec (forkedProcessTimeoutInSeconds), the plugin issues 'shutdown' 
to *all* forked JVMs (that's what we see *.dump files), basically killing every 
test running at that time.  (IDK why all jvm, it's just what am observing)
** Side note: if findHangingTests.py reports X number of hanging tests, and we 
have surefire forkcount as Y, then there should be *at least* ceiling[X/Y] 
count of the following message in each *.dump file.
{noformat}
# Created on 2018-01-25T12:14:47.114
Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook.
{noformat}

With that figured, it was easy to find culprit tests.
Look for timestamp of "Killed self fork..." messages in dump file and find the 
tests which started *exactly 900 sec before it* Any hanging test (as reported 
by our script) with start timestamp between these two times was just caught in 
cross fire.

Applying the method to #207 run 
(https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal 
these three culprit tests:
- security.token.TestTokenAuthentication
- master.balancer.TestStochasticLoadBalancer
- regionserver.TestRegionServerReportForDuty


was (Author: appy):
I think i have cracked it, it's basically this:
- Some test goes bad (which one? - basically what we were trying to figure out)
- After 900 sec (forkedProcessTimeoutInSeconds), surefire plugin issues 
'shutdown' to *all* forked JVMs (that's what we see *.dump files), basically 
killing every test running at that time.
** Side note: if findHangingTests.py reports X number of hanging tests, and we 
have surefire forkcount as Y, then there should be *at least* ceiling[X/Y] 
count of the following message in each *.dump file.
{noformat}
# Created on 2018-01-25T12:14:47.114
Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook.
{noformat}

With that figured, it was easy to find culprit tests.
Look for timestamp of "Killed self fork..." messages in dump file and find the 
tests which started *exactly 900 sec before it* Any hanging test (as reported 
by our script) with start timestamp between these two times was just caught in 
cross fire.

Applying the method to #207 run 
(https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal 
these three culprit tests:
- security.token.TestTokenAuthentication
- master.balancer.TestStochasticLoadBalancer
- regionserver.TestRegionServerReportForDuty

> False positive for the HBASE-Find-Flaky-Tests job
> -------------------------------------------------
>
>                 Key: HBASE-19803
>                 URL: https://issues.apache.org/jira/browse/HBASE-19803
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Major
>         Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

Reply via email to