[
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976582#comment-16976582
]
Paul Millar commented on SUREFIRE-1719:
---------------------------------------
Hi Tibor,
Just a quick update.
I've been able to reproduce the problem with Surefire v3.0.0-M4.
Running the same command (mvn -am -pl modules/common clean package), the
unit-tests failed with a similar output as reported above. I attempted
building with v3.0.0-M4 ten times and all attempts failed. This is seemingly a
worse failure rate than described above; although, if this is a race-condition,
external factors may be having an impact.
This is building on a 24-core machine. As before, updating the pom.xml to set
forkCount from "1C" (== 24) to "6" results in all ten build attempts passing.
Despite the error message mentioning *.dump and *.dumpstream files, I see no
such files in the target/surefire-reports directory. I also checked /tmp and
/var/tmp: no such files are there, either.
I'm not sure which options I should add to avoid GC logging: I believe this is
off by default. If you have a concrete suggestion, I'd be happy to try it out.
I'm using OpenJDK "v1.8.0_232" JRE "1.8.0_232-b09", OpenJDK 64-bit server
(build 25.232-b09, mixed mode).
One additional observation. The "modules/common" module has 35 test classes.
For each of the 35 classes, I see maven logs that is is running that class
(lines like "[INFO] Running org.dcache...."). However, there are *fewer*
logged test result lines (lines like "[INFO] Tests run: x, Failures: x, Errors:
x, ...."). So, maven/surefire is not logging a result line for some test
classes, despite (apparently) starting that JVM.
The specific "missing" results changes from run to run. The precise number of
missing test class results also changes: sometimes only one class is missing,
sometimes it's two classes.
For each test class with a test result line, there is a corresponding .txt and
.xml file in results/surefire-reports directory. However, the test classes
that have no test result line also have no corresponding .txt or .xml file.
HTH,
Paul.
> Race condition results in "VM crash or System.exit called?" failure
> -------------------------------------------------------------------
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
> Issue Type: Bug
> Components: Maven Surefire Plugin
> Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2,
> 3.0.0-M1, 3.0.0-M3
> Reporter: Paul Millar
> Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3,
> unit tests started to fail with the message "ExecutionException The forked VM
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl
> modules/common clean package" and the surefire configuration is:
> {{<plugin>}}
> {{ <groupId>org.apache.maven.plugins</groupId>}}
> {{ <artifactId>maven-surefire-plugin</artifactId>}}
> {{ <configuration>}}
> {{ <includes>}}
> {{ <include>**/*Test.class</include>}}
> {{ <include>**/*Tests.class</include>}}
> {{ </includes>}}
> {{ <!-- dCache uses the singleton anti-pattern in way}}
> {{ too many places. That unfortunately means we have}}
> {{ to accept the overhead of forking each test run. -->}}
> {{ <forkCount>1C</forkCount>}}
> {{ <reuseForks>false</reuseForks>}}
> {{ </configuration>}}
> {{ </plugin>}}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the
> problem appear 6 out of 10 times when running the above mvn command. There is
> (apparently) little that seems to influence whether the build will succeed or
> fail.
> [I've attached the complete output from running the above mvn command, both
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores.
> Our build machine has 24 cores, and I've seen a report of a similar problem
> where building dCache on a 48 core machine. On the other side, I have been
> unable to reproduce the problem with my desktop machine (8 core) or on my
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on
> a machine with a small number of cores. However, I've noticed that, on an 8
> core machine, increasing the forkCount does not actually result in that many
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood
> of a problem below 10% (0 failures with 10 builds) on our build machine. On
> this machine, the default configuration would try to run 24 JVM instances
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building
> with surefire v2.19.1, the above mvn command is always successful on our
> build machine. Building with surefire v2.20 results in intermittent failures
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342
> Acknowledge normal exit of JVM and drain shared memory between processes" is
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it
> introduces is somehow racy (for example, reading or updating a field-member
> outside of a monitor) and problems are triggered if there are a large number
> of JVMs exiting concurrently. So, with increased number of concurrent JVMs
> there is an increased risk of a thread loosing the race, and so triggering
> this error.
> Such a problem would be consistent with observed behaviour. However, I don't
> have any strong evidence that this is what is happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)