[ 
https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976582#comment-16976582
 ] 

Paul Millar commented on SUREFIRE-1719:
---------------------------------------

Hi Tibor,

Just a quick update.

I've been able to reproduce the problem with Surefire v3.0.0-M4.

Running the same command (mvn -am -pl modules/common clean package), the 
unit-tests failed with a similar output as reported above.  I attempted 
building with v3.0.0-M4 ten times and all attempts failed.  This is seemingly a 
worse failure rate than described above; although, if this is a race-condition, 
external factors may be having an impact.

This is building on a 24-core machine.  As before, updating the pom.xml to set 
forkCount from "1C" (== 24) to "6" results in all ten build attempts passing.

Despite the error message mentioning *.dump and *.dumpstream files, I see no 
such files in the target/surefire-reports directory.  I also checked /tmp and 
/var/tmp: no such files are there, either.

I'm not sure which options I should add to avoid GC logging: I believe this is 
off by default.  If you have a concrete suggestion, I'd be happy to try it out. 
 I'm using OpenJDK "v1.8.0_232" JRE "1.8.0_232-b09", OpenJDK 64-bit server 
(build 25.232-b09, mixed mode).

One additional observation.  The "modules/common" module has 35 test  classes.  
For each of the 35 classes, I see maven logs that is is running that class 
(lines like "[INFO] Running org.dcache....").  However, there are *fewer* 
logged test result lines (lines like "[INFO] Tests run: x, Failures: x, Errors: 
x, ....").  So, maven/surefire is not logging a result line for some test 
classes, despite (apparently) starting that JVM. 

The specific "missing" results changes from run to run.  The precise number of 
missing test class results also changes: sometimes only one class is missing, 
sometimes it's two classes.

For each test class with a test result line, there is a corresponding .txt and 
.xml file in results/surefire-reports directory.  However, the test classes 
that have no test result line also have no corresponding .txt or .xml file.

HTH,
Paul.

> Race condition results in "VM crash or System.exit called?" failure
> -------------------------------------------------------------------
>
>                 Key: SUREFIRE-1719
>                 URL: https://issues.apache.org/jira/browse/SUREFIRE-1719
>             Project: Maven Surefire
>          Issue Type: Bug
>          Components: Maven Surefire Plugin
>    Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 
> 3.0.0-M1, 3.0.0-M3
>            Reporter: Paul Millar
>            Priority: Major
>         Attachments: build-error-debug.out, build.out, pom.xml
>
>
> After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, 
> unit tests started to fail with the message "ExecutionException The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?"
> For reference, the command I am using to verify this problem is "mvn -am -pl 
> modules/common clean package" and the surefire configuration is:
> {{<plugin>}}
> {{  <groupId>org.apache.maven.plugins</groupId>}}
> {{  <artifactId>maven-surefire-plugin</artifactId>}}
> {{  <configuration>}}
> {{    <includes>}}
> {{      <include>**/*Test.class</include>}}
> {{      <include>**/*Tests.class</include>}}
> {{    </includes>}}
> {{    <!-- dCache uses the singleton anti-pattern in way}}
> {{    too many places. That unfortunately means we have}}
> {{    to accept the overhead of forking each test run. -->}}
> {{    <forkCount>1C</forkCount>}}
> {{    <reuseForks>false</reuseForks>}}
> {{  </configuration>}}
> {{ </plugin>}}
> [The complete pom.xml is attached.]
> This problem is not always present. On our build machine, I've seen the 
> problem appear 6 out of 10 times when running the above mvn command. There is 
> (apparently) little that seems to influence whether the build will succeed or 
> fail.
> [I've attached the complete output from running the above mvn command, both 
> the normal output and including the -e -X options.]
> The problem seems to appear only on machines with a "large" number of cores. 
> Our build machine has 24 cores, and I've seen a report of a similar problem 
> where building dCache on a 48 core machine. On the other side, I have been 
> unable to reproduce the problem with my desktop machine (8 core) or on my 
> laptop (4 cores).
> What seems to matter is the number of actually running JVM instances.
> I have not been able to reproduce the problem by increasing the forkCount on 
> a machine with a small number of cores. However, I've noticed that, on an 8 
> core machine, increasing the forkCount does not actually result in that many 
> more JVM instances running.
> Similarly, experience shows that reducing the number of concurrent JVM 
> instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood 
> of a problem below 10% (0 failures with 10 builds) on our build machine.  On 
> this machine, the default configuration would try to run 24 JVM instances 
> concurrently (forkCount of "1C" on a 24 core machine).
> The problem appears to have been introduced in surefire v2.20. When building 
> with surefire v2.19.1, the above mvn command is always successful on our 
> build machine.  Building with surefire v2.20 results in intermittent failures 
> (~60% failure rate).
> Using git bisection (and with the criterion for "good" as zero failures in 10 
> build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 
> Acknowledge normal exit of JVM and drain shared memory between processes" is 
> the first commit where surefire has this intermittent failure behaviour.
> From a causal scan through the patch, my guess is that the BYE_ACK support it 
> introduces is somehow racy (for example, reading or updating a field-member 
> outside of a monitor) and problems are triggered if there are a large number 
> of JVMs exiting concurrently.  So, with increased number of concurrent JVMs 
> there is an increased risk of a thread loosing the race, and so triggering 
> this error.
> Such a problem would be consistent with observed behaviour.  However, I don't 
> have any strong evidence that this is what is happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to