[
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340483#comment-16340483
]
Appy edited comment on HBASE-19803 at 1/26/18 2:55 AM:
-------------------------------------------------------
I think i have cracked it, it's basically this:
{color:red}Edit{color}: Adding more details.
- Some test goes bad (which one? - basically what we were trying to figure out)
in a way that the jvm dies (Hence our category based timeout is useless here)
- Since the jvm died without reporting to surefire, the plugin's main process
keeps waiting
- After 900 sec (forkedProcessTimeoutInSeconds), the plugin issues 'shutdown'
to *all* forked JVMs (that's what we see *.dump files), basically killing every
test running at that time. (IDK why all jvm, it's just what am observing)
** Side note: if findHangingTests.py reports X number of hanging tests, and we
have surefire forkcount as Y, then there should be *at least* ceiling[X/Y]
count of the following message in each *.dump file.
{noformat}
# Created on 2018-01-25T12:14:47.114
Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook.
{noformat}
With that figured, it was easy to find culprit tests.
Look for timestamp of "Killed self fork..." messages in dump file and find the
tests which started *exactly 900 sec before it* Any hanging test (as reported
by our script) with start timestamp between these two times was just caught in
cross fire.
Applying the method to #207 run
(https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal
these three culprit tests:
- security.token.TestTokenAuthentication
- master.balancer.TestStochasticLoadBalancer
- regionserver.TestRegionServerReportForDuty
was (Author: appy):
I think i have cracked it, it's basically this:
- Some test goes bad (which one? - basically what we were trying to figure out)
- After 900 sec (forkedProcessTimeoutInSeconds), surefire plugin issues
'shutdown' to *all* forked JVMs (that's what we see *.dump files), basically
killing every test running at that time.
** Side note: if findHangingTests.py reports X number of hanging tests, and we
have surefire forkcount as Y, then there should be *at least* ceiling[X/Y]
count of the following message in each *.dump file.
{noformat}
# Created on 2018-01-25T12:14:47.114
Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook.
{noformat}
With that figured, it was easy to find culprit tests.
Look for timestamp of "Killed self fork..." messages in dump file and find the
tests which started *exactly 900 sec before it* Any hanging test (as reported
by our script) with start timestamp between these two times was just caught in
cross fire.
Applying the method to #207 run
(https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal
these three culprit tests:
- security.token.TestTokenAuthentication
- master.balancer.TestStochasticLoadBalancer
- regionserver.TestRegionServerReportForDuty
> False positive for the HBASE-Find-Flaky-Tests job
> -------------------------------------------------
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream,
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It
> seems that the jvm may crash during the mvn test run and then we will kill
> all the running tests and then we may mark some of them as hang which leads
> to the false positive.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)