[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340483#comment-16340483 ]
Appy commented on HBASE-19803: ------------------------------ I think i have cracked it, it's basically this: - Some test goes bad (which one? - basically what we were trying to figure out) - After 900 sec (forkedProcessTimeoutInSeconds), surefire plugin issues 'shutdown' to *all* forked JVMs (that's what we see *.dump files), basically killing every test running at that time. ** Side note: if findHangingTests.py reports X number of hanging tests, and we have surefire forkcount as Y, then there should be *at least* ceiling[X/Y] count of the following message in each *.dump file. {noformat} # Created on 2018-01-25T12:14:47.114 Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook. {noformat} With that figured, it was easy to find culprit tests. Look for timestamp of "Killed self fork..." messages in dump file and find the tests which started *exactly 900 sec before it* Any hanging test (as reported by our script) with start timestamp between these two times was just caught in cross fire. Applying the method to #207 run (https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal these three culprit tests: - security.token.TestTokenAuthentication - master.balancer.TestStochasticLoadBalancer - regionserver.TestRegionServerReportForDuty > False positive for the HBASE-Find-Flaky-Tests job > ------------------------------------------------- > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)