[
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976598#comment-16976598
]
Tibor Digana commented on SUREFIRE-1719:
----------------------------------------
[~paulmillar]
It's strange that you do not have the dump files. I do not have access to your
system, so the best would be if you deploy Surefire with orthogonal SNAPSHOT
version on your Nexus repo and let the CI to run the build. Comment off the
parts of the code in ForkedBooter and ForkStarter where you have suspicions. I
could do the same if i had to access your system. This is the way to confirm
the bug. And then we have to investigate deeper.
Make sure that your system, especially the file system, is okay and the machine
does not run out of resources.
> Race condition results in "VM crash or System.exit called?" failure
> -------------------------------------------------------------------
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
> Issue Type: Bug
> Components: Maven Surefire Plugin
> Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2,
> 3.0.0-M1, 3.0.0-M3
> Reporter: Paul Millar
> Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3,
> unit tests started to fail with the message "ExecutionException The forked VM
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl
> modules/common clean package" and the surefire configuration is:
> {{<plugin>}}
> {{ <groupId>org.apache.maven.plugins</groupId>}}
> {{ <artifactId>maven-surefire-plugin</artifactId>}}
> {{ <configuration>}}
> {{ <includes>}}
> {{ <include>**/*Test.class</include>}}
> {{ <include>**/*Tests.class</include>}}
> {{ </includes>}}
> {{ <!-- dCache uses the singleton anti-pattern in way}}
> {{ too many places. That unfortunately means we have}}
> {{ to accept the overhead of forking each test run. -->}}
> {{ <forkCount>1C</forkCount>}}
> {{ <reuseForks>false</reuseForks>}}
> {{ </configuration>}}
> {{ </plugin>}}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the
> problem appear 6 out of 10 times when running the above mvn command. There is
> (apparently) little that seems to influence whether the build will succeed or
> fail.
> [I've attached the complete output from running the above mvn command, both
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores.
> Our build machine has 24 cores, and I've seen a report of a similar problem
> where building dCache on a 48 core machine. On the other side, I have been
> unable to reproduce the problem with my desktop machine (8 core) or on my
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on
> a machine with a small number of cores. However, I've noticed that, on an 8
> core machine, increasing the forkCount does not actually result in that many
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood
> of a problem below 10% (0 failures with 10 builds) on our build machine. On
> this machine, the default configuration would try to run 24 JVM instances
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building
> with surefire v2.19.1, the above mvn command is always successful on our
> build machine. Building with surefire v2.20 results in intermittent failures
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342
> Acknowledge normal exit of JVM and drain shared memory between processes" is
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it
> introduces is somehow racy (for example, reading or updating a field-member
> outside of a monitor) and problems are triggered if there are a large number
> of JVMs exiting concurrently. So, with increased number of concurrent JVMs
> there is an increased risk of a thread loosing the race, and so triggering
> this error.
> Such a problem would be consistent with observed behaviour. However, I don't
> have any strong evidence that this is what is happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)