[
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073728#comment-17073728
]
Tigran Mkrtchyan commented on SUREFIRE-1719:
--------------------------------------------
As M5 is not released yet, can we just use threads instead or set forkCount=1
as a workaround?
> Race condition results in "VM crash or System.exit called?" failure
> -------------------------------------------------------------------
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
> Issue Type: Bug
> Components: Maven Surefire Plugin
> Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2,
> 3.0.0-M1, 3.0.0-M3
> Reporter: Paul Millar
> Assignee: Tibor Digana
> Priority: Major
> Fix For: 3.0.0-M5
>
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3,
> unit tests started to fail with the message "ExecutionException The forked VM
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl
> modules/common clean package" and the surefire configuration is:
> {{<plugin>}}
> {{ <groupId>org.apache.maven.plugins</groupId>}}
> {{ <artifactId>maven-surefire-plugin</artifactId>}}
> {{ <configuration>}}
> {{ <includes>}}
> {{ <include>**/*Test.class</include>}}
> {{ <include>**/*Tests.class</include>}}
> {{ </includes>}}
> {{ <!-- dCache uses the singleton anti-pattern in way}}
> {{ too many places. That unfortunately means we have}}
> {{ to accept the overhead of forking each test run. -->}}
> {{ <forkCount>1C</forkCount>}}
> {{ <reuseForks>false</reuseForks>}}
> {{ </configuration>}}
> {{ </plugin>}}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the
> problem appear 6 out of 10 times when running the above mvn command. There is
> (apparently) little that seems to influence whether the build will succeed or
> fail.
> [I've attached the complete output from running the above mvn command, both
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores.
> Our build machine has 24 cores, and I've seen a report of a similar problem
> where building dCache on a 48 core machine. On the other side, I have been
> unable to reproduce the problem with my desktop machine (8 core) or on my
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on
> a machine with a small number of cores. However, I've noticed that, on an 8
> core machine, increasing the forkCount does not actually result in that many
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood
> of a problem below 10% (0 failures with 10 builds) on our build machine. On
> this machine, the default configuration would try to run 24 JVM instances
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building
> with surefire v2.19.1, the above mvn command is always successful on our
> build machine. Building with surefire v2.20 results in intermittent failures
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342
> Acknowledge normal exit of JVM and drain shared memory between processes" is
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it
> introduces is somehow racy (for example, reading or updating a field-member
> outside of a monitor) and problems are triggered if there are a large number
> of JVMs exiting concurrently. So, with increased number of concurrent JVMs
> there is an increased risk of a thread loosing the race, and so triggering
> this error.
> Such a problem would be consistent with observed behaviour. However, I don't
> have any strong evidence that this is what is happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)