[
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975676#comment-16975676
]
Tibor Digana commented on SUREFIRE-1719:
----------------------------------------
[~paulmillar]
Pls configure your CI system so that you will be able to handle build files
from the file system in CI and observe the dump files located in
{{target/surefire-reports}}. These streams use to be corrupted by native logs
printed by JVM GC in the std/out and std/in. Try to make the GC quite for a
try. My suspicion is that GC corrupted the std/out stream and the BYE event was
lost. This has happenedat the end after all tests completed and GC printed the
messages at the time when the forked JVM sent BYE and the fork was waiting for
next 30 seconds for the acknowledgement and killed itself if not arived.
We will introduce TCP connector instead of process pipes in 3.0.0-M5. Meanwhile
i would like you to confirm this problem with corrupted stream which can be
seen in the dump file produced by Surefire.
> Race condition results in "VM crash or System.exit called?" failure
> -------------------------------------------------------------------
>
> Key: SUREFIRE-1719
> URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
> Project: Maven Surefire
> Issue Type: Bug
> Components: Maven Surefire Plugin
> Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2,
> 3.0.0-M1, 3.0.0-M3
> Reporter: Paul Millar
> Priority: Major
> Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3,
> unit tests started to fail with the message "ExecutionException The forked VM
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl
> modules/common clean package" and the surefire configuration is:
> {{<plugin>}}
> {{ <groupId>org.apache.maven.plugins</groupId>}}
> {{ <artifactId>maven-surefire-plugin</artifactId>}}
> {{ <configuration>}}
> {{ <includes>}}
> {{ <include>**/*Test.class</include>}}
> {{ <include>**/*Tests.class</include>}}
> {{ </includes>}}
> {{ <!-- dCache uses the singleton anti-pattern in way}}
> {{ too many places. That unfortunately means we have}}
> {{ to accept the overhead of forking each test run. -->}}
> {{ <forkCount>1C</forkCount>}}
> {{ <reuseForks>false</reuseForks>}}
> {{ </configuration>}}
> {{ </plugin>}}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the
> problem appear 6 out of 10 times when running the above mvn command. There is
> (apparently) little that seems to influence whether the build will succeed or
> fail.
> [I've attached the complete output from running the above mvn command, both
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores.
> Our build machine has 24 cores, and I've seen a report of a similar problem
> where building dCache on a 48 core machine. On the other side, I have been
> unable to reproduce the problem with my desktop machine (8 core) or on my
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on
> a machine with a small number of cores. However, I've noticed that, on an 8
> core machine, increasing the forkCount does not actually result in that many
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood
> of a problem below 10% (0 failures with 10 builds) on our build machine. On
> this machine, the default configuration would try to run 24 JVM instances
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building
> with surefire v2.19.1, the above mvn command is always successful on our
> build machine. Building with surefire v2.20 results in intermittent failures
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342
> Acknowledge normal exit of JVM and drain shared memory between processes" is
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it
> introduces is somehow racy (for example, reading or updating a field-member
> outside of a monitor) and problems are triggered if there are a large number
> of JVMs exiting concurrently. So, with increased number of concurrent JVMs
> there is an increased risk of a thread loosing the race, and so triggering
> this error.
> Such a problem would be consistent with observed behaviour. However, I don't
> have any strong evidence that this is what is happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)