Hi Stuart
Thank you for your review and suggestion.
Yes, since this failure mode is very hard to be reproduced. I guess it's
make sense to do some hack. And I also noticed in
ActivationLibrary.rmidRunning. It does try to look up ActivationSystem
but doesn't check if it's null. So I add the logic to make sure we will
look up the non-null ActivationSystem. Also I did some cleanup if you
don't mind. Add a waitFor(long timeout, TimeUnit unit) for JavaVM. Which
we can have a better waitFor control.
I appreciate you can review the code again.
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
Thank you
Tristan
On 01/25/2014 10:20 AM, Stuart Marks wrote:
On 1/23/14 10:34 PM, Tristan Yan wrote:
Hi All
Could you review the bug fix for JDK-8032050.
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
Description:
This rare happened failure caused because when RMID starts. It don't
guarantee
sun.rmi.server.Activation.startActivation finishes.
Fix by adding a iterative getSystem with a 5 seconds timeout.
Hi Tristan,
Adding a timing/retry loop into this test isn't the correct approach
for fixing this test.
The timing loop isn't necessary because there is already a timing loop
in RMID.start() in the RMI test library. (There's another timing loop
in ActivationLibrary.rmidRunning() which should probably be removed.)
So the intent of this library call is that it start rmid and wait for
it to become ready. That logic doesn't need to be added to the test.
In the bug report JDK-8032050 you had mentioned that the
NullPointerException was suspicious. You're right! I took a look and
it seemed like it was related to JDK-8023541, and I added a note to
this effect to the bug report. The problem here is that rmid can come
up and transiently return null instead of the stub of the activation
system. That's what JDK-8023541 covers. I think that rmid itself needs
to be fixed, though modifying the timing loop in the RMI test library
to wait for rmid to come up *and* for the lookup to return non-null is
an easy way to fix the problem. (Or at least cover it up.)
The next step in the analysis is to determine, given that
ActivationLibrary.getSystem can sometimes return null, whether this
has actually caused this test failure. This is pretty easy to
determine; just hack in a line "system = null" in the right place and
run the test. I've done this, and the test times out and the output
log is pretty much identical to the one in the bug report. (I
recommend you try this yourself.) So I think it's fairly safe to say
that the problem in JDK-8023541 has caused the failure listed in
JDK-8032050.
I can see a couple ways to proceed here. One way is just to close this
out as a duplicate of JDK-8023541 since that bug caused this failure.
Another is that this test could use some cleaning up. This bug
certainly covers a failure, but the messages emitted are confusing and
in some cases completely wrong. For example, the "rmid has shutdown"
message at line 180 is incorrect, because in this case rmid is still
running and the wait() call has timed out. Most of the code here can
be replaced with calls to various bits of the RMI test library. There
are a bunch of other things in this test that could be cleaned up as
well.
It's up to you how you'd like to proceed.
s'marks