On 1/23/14 10:34 PM, Tristan Yan wrote:
Hi All
Could you review the bug fix for JDK-8032050.

http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/

Description:
This rare happened failure caused because when RMID starts. It don't guarantee
sun.rmi.server.Activation.startActivation finishes.
Fix by adding a iterative getSystem with a 5 seconds timeout.

Hi Tristan,

Adding a timing/retry loop into this test isn't the correct approach for fixing this test.

The timing loop isn't necessary because there is already a timing loop in RMID.start() in the RMI test library. (There's another timing loop in ActivationLibrary.rmidRunning() which should probably be removed.) So the intent of this library call is that it start rmid and wait for it to become ready. That logic doesn't need to be added to the test.

In the bug report JDK-8032050 you had mentioned that the NullPointerException was suspicious. You're right! I took a look and it seemed like it was related to JDK-8023541, and I added a note to this effect to the bug report. The problem here is that rmid can come up and transiently return null instead of the stub of the activation system. That's what JDK-8023541 covers. I think that rmid itself needs to be fixed, though modifying the timing loop in the RMI test library to wait for rmid to come up *and* for the lookup to return non-null is an easy way to fix the problem. (Or at least cover it up.)

The next step in the analysis is to determine, given that ActivationLibrary.getSystem can sometimes return null, whether this has actually caused this test failure. This is pretty easy to determine; just hack in a line "system = null" in the right place and run the test. I've done this, and the test times out and the output log is pretty much identical to the one in the bug report. (I recommend you try this yourself.) So I think it's fairly safe to say that the problem in JDK-8023541 has caused the failure listed in JDK-8032050.

I can see a couple ways to proceed here. One way is just to close this out as a duplicate of JDK-8023541 since that bug caused this failure.

Another is that this test could use some cleaning up. This bug certainly covers a failure, but the messages emitted are confusing and in some cases completely wrong. For example, the "rmid has shutdown" message at line 180 is incorrect, because in this case rmid is still running and the wait() call has timed out. Most of the code here can be replaced with calls to various bits of the RMI test library. There are a bunch of other things in this test that could be cleaned up as well.

It's up to you how you'd like to proceed.

s'marks

Reply via email to