Hi Uwe,

I've downloaded lucene-5.0-2013-03-05_15-37-06.zip from https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/

I don't have ant on my workstation so do you have a java command line to run the test(s) that generate the error?

Thanks,

JohnC

On 3/6/2013 3:16 AM, Uwe Schindler wrote:
Hi,
I think this is a VM bug and the thread dumps that Uwe produced are enough
to start tracking down the root cause.
I hope it is enough! If I can help with more details, tell me what I should do to track 
this down. Unfortunately, we have no isolated test case (like a small java class that 
triggers this bug) - you have to run the test cases of this Lucene's module. It only 
happens there, not in any other Lucene test suite. It may be caused by a lot of GC 
activity in this "UIMA" module or a specific test.

On 3/6/13 8:52 AM, David Holmes wrote:
If the VM is completely unresponsive then it suggests we are at a
safepoint.
Yes, we are hanging during a stop-the-world GC, so we are at a safepoint.

The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.
It looks like the reference processing thread is stuck in a loop where it does
wait(). So, the VM is hanging even if that stack trace also ends up in
os::park().

The thing is to find out why they are not being woken up.
Actually, in this case we should probably not even be calling wait...

Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.
I received the attachment with the original email. I've attached it to
the bug report that I created: 8009536. You can find it there if you
want to. But I think we have a fairly good idea of what change caused
the hang.
If it helps: Unfortunately, we had some problems with recent JDK builds, 
because javac and javadoc tools were not working correctly, failing to build 
our source code. Since b78 this was fixed. Until this was fixed, we used build 
b65 (which was the last one working) and the G1GC hangs did not appear on this 
version. So it must have happened by a change after b65 till b78.

Uwe

Bengt

Thanks,
David

On 6/03/2013 4:07 PM, Krystal Mok wrote:
Hi Uwe,

If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).

- Kris

On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
<uschind...@apache.org>
wrote:
Hi,

since a few month we are extensively testing various preview builds
of JDK 8 for compatibility with Apache Lucene and Solr, so we can
find any bugs early and prevent the problems we had with the release
of Java 7 two years ago. Currently we have a Linux (Ubuntu 64bit)
Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8 snapshot,
IBM J9, older JRockit) installed, choosing a different one with
different hotspot and garbage collector settings on every run of the
test suite (which takes approx. 30-45 minutes).

JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer at
the moment. There is one configuration that constantly and
reproducibly hangs in one module that is tested: The configuration
uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client
does not matter). The JVM running the tests hangs irresponsible
(jstack or kill -3 have no effect/cannot connect, standard kill does
not stop it, only kill -9 actually kills it). It can be reproduced
in this Lucene module 100% (it hangs always).

I was able to connect with GDB to the JVM and get a stack trace on
all threads (see attachment, dump.txt). As you see all threads of
G1GC seem to hang in a syscall (os:park(), a conditional wait in
pthread library). Unfortunately that’s all I can give you. A Java
stacktrace is not possible because the JVM reacts on neither kill -3
nor jstack. With all other garbage collectors it passes the test
without hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32 bit
variant is affected. Client or Server VM makes no difference.

To reproduce:
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this
should not matter)
- Download Lucene Source code (e.g. the snapshot version we were
testing with:
https://builds.apache.org/job/Lucene-Artifacts-
trunk/2212/artifact/lucene/dist/)
- change to directory lucene/analysis/uima and run:
          ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3
-Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because
the child VM actually running the test no longer responds). The PID
is also printed. Try to get a stack trace or kill it, no response.
Only kill -9 helps. Choosing another garbage collector in the above
command line makes the test finish after a few seconds, e.g.
-Dargs="-server -XX:+UseConcMarkSweepGC"

I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we
were able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.

Uwe

-----
Uwe Schindler
uschind...@apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to