ShutdownGracefully.java fails intermittently

Tristan Yan Fri, 14 Feb 2014 21:29:54 -0800

Thank you Stuart
This is nice. I thought two variables were weird but I didn’t figure out the 
right solution.  Also I wasn’t so sure why we print out so many messages now 
it’s clear to me.
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.05/
I’m sorry I have to ask you review this again.
regards
Tristan


On Feb 15, 2014, at 9:47 AM, Stuart Marks <stuart.ma...@oracle.com> wrote:

> Hi Tristan,
> 
> OK, we're getting close. Just a couple things about ShutdownGracefully.java.
> 
> It's a bit clumsy to have both a boolean and a message string to keep track 
> of the state of the test, but by itself this isn't such a big deal.
> 
> A bigger deal is the lack of exception chaining. If we catch an unexpected 
> exception at line 162, its class and message are printed out, but its stack 
> trace isn't. If this were to occur it might be fiendishly difficult to debug.
> 
> The TimeoutException isn't chained up either, but this isn't so bad, since 
> there's only one place the timeout can occur. Still, it's good to chain 
> exceptions where possible.
> 
> There's another failure case that doesn't have an exception at all, which is 
> when the registration request we're expecting to fail actually succeeds. This 
> doesn't have an exception, but we should consider creating one.
> 
> Here's an approach to getting rid of the boolean+string and also chaining up 
> exceptions. The key insight is that an exception can be created in a 
> different place from where it's thrown.
> 
> Instead of the boolean+stream, have a variable of type Exception that's 
> initialized to null. Anyplace where we catch an exception that indicates 
> failure, fill in this variable. The idea is that if this exception variable 
> is non-null, a failure has occurred. The exception being non-null replaces 
> the boolean, and the exception message replaces the failure string. In 
> addition, the exception has a stack trace, which is essential for debugging.
> 
> For the case where we don't get the expected exception (i.e., registration 
> succeeds unexpectedly), simply set the exception variable to a newly created 
> exception, but don't throw it yet. For example,
> 
>    exception = new RuntimeException(
>        "The registration request succeeded unexpectedly");
> 
> At the place where we catch an unexpected exception, wrap the caught 
> exception in a new RuntimeException with a message like "caught unexpected 
> exception". The reason to do this is so we can add an additional message. The 
> stack trace will also be a bit easier to follow.
> 
> For the timeout, just assign the TimeoutException to the exception variable.
> 
> Also, at each location where the exception variable is assigned to, print out 
> a message at that point. It will at least show the state of the test to be a 
> failure. The reason is that, if rmid.destroy() were to throw an exception 
> from the finally-block, our carefully recorded exception state will be thrown 
> away. (An alternative would be to put this into its own try-block, and add 
> any exceptions caught from it to the suppressed exception list of the 
> exception variable, but it's not clear to me that this is worth it.)
> 
> At the end of the test, if the exception variable is non-null, call 
> TestLibrary.bomb() with it to fail the test.
> 
> Finally, one typo: "prcoess".
> 
> Thanks for updating this webrev again.
> 
> s'marks
> 
> On 2/13/14 12:25 AM, Tristan Yan wrote:
>> Thank you Stuart
>> I have fixed comment in JavaVM.java. Dealing with different cases in
>> ShutdownGracefully.java, two variables were added. One is a flag indicate 
>> test
>> passed or not. Other variable keeps the error message when test failed. I put
>> TestLibrary.bomb in the bottom of the main method which only shows test fail
>> message.
>> Could you review it again
>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.04/
>> Tristan
>> 
>> On 02/13/2014 05:29 AM, Stuart Marks wrote:
>>> Hi Tristan,
>>> 
>>> JavaVM.waitFor looks mostly fine. The indentation of the start of the
>>> waitFor(timeout) javadoc comment is a bit off, though; please fix.
>>> 
>>> There are still some adjustments that need to be made in
>>> ShutdownGracefully.java. Both have to do with the case where the last
>>> registration succeeds unexpectedly -- this is the one that we expect to 
>>> fail.
>>> 
>>> First, if this registration has succeeded unexpectedly, that means rmid is
>>> still running. If that occurs, the call to rmid.waitFor(timeout) will
>>> inevitably time out. It may be worth calling rmid.destroy() directly at this
>>> point.
>>> 
>>> Second, still in the succeeded-unexpectedly case, at line 154
>>> TestLibrary.bomb() is called. This throws an exception, but it's caught by 
>>> the
>>> catch-block at lines 157-158, which calls TestLibrary.bomb() again, saying
>>> "unexpected exception". Except that this is kind of expected, since it was
>>> thrown from an earlier call to TestLibrary.bomb(). This is quite confusing.
>>> 
>>> There are several cases that need to be handled.
>>> 
>>> 1. Normal case. Registration fails as expected, rmid has terminated
>>> gracefully. Test passes.
>>> 
>>> 2. Rmid is still running and has processed the registration request
>>> successfully. Need to kill rmid and fail the test.
>>> 
>>> 3. Rmid is still running but is in some bad state where the registration
>>> request failed. Need to kill rmid and fail the test.
>>> 
>>> 4. Some other unexpected failure. This is what the catch and finally blocks 
>>> at
>>> lines 157-161 are for.
>>> 
>>> These four cases need to be handled independently. Ideally they should be
>>> separated from the cleanup code. As noted above, you don't want to throw an
>>> exception from the try-block, because it will get caught by your own catch
>>> block. Similarly, it's tempting to return from the midst of the try-block in
>>> the success case, but this still runs the finally-block. This can be quite
>>> confusing.
>>> 
>>> A typical technique for dealing with this kind of issue is to record results
>>> of operations from within the try block, and then analyze the results 
>>> outside,
>>> throwing a test failure (TestLibrary.bomb) at that point, where it won't be
>>> caught by the test's own catch block.
>>> 
>>> Editoral:
>>> - line 153, there should be a space between 'if' and the opening paren
>>> - line 156, typo, "gracefuuly"
>>> 
>>> Finally, it would be helpful if you could get webrev to generate the actual
>>> changeset instead of the plain patch, per my other review email.
>>> 
>>> Thanks,
>>> 
>>> s'marks
>>> 
>>> 
>>> On 2/11/14 9:39 PM, Tristan Yan wrote:
>>>> Thank you for your thorough mail. This is very educational. I took you 
>>>> advice
>>>> and generated a new webrev for this.
>>>> 
>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.03/
>>>> I appreciate you can review this again.
>>>> Regards
>>>> Tristan
>>>> 
>>>> 
>>>> On Feb 11, 2014, at 8:32 AM, Stuart Marks <stuart.ma...@oracle.com
>>>> <mailto:stuart.ma...@oracle.com>> wrote:
>>>> 
>>>>> Hi Tristan,
>>>>> 
>>>>> Sorry about my recurring delays.
>>>>> 
>>>>> Several comments on these changes.
>>>>> 
>>>>> JavaVM.java --
>>>>> 
>>>>> The waitFor(timeout) method is mostly ok. The new thread started at line 
>>>>> 208
>>>>> and following seems unnecessary, though. This code is reached when a 
>>>>> timeout
>>>>> occurs, so throwing TimeoutException is the only thing necessary in this 
>>>>> case.
>>>>> Should the code to start the new thread be removed?
>>>>> 
>>>>> There should be a similar check for vm == null as in the waitFor() [no 
>>>>> args]
>>>>> method.
>>>>> 
>>>>> ShutdownGracefully.java --
>>>>> 
>>>>> The condition that's checked after calling rmid.waitFor(SHUTDOWN_TIMEOUT) 
>>>>> is
>>>>> incorrect. It's testing the exit status against zero. Offhand, when and if
>>>>> rmid exits, it might exit with a nonzero exit status. If rmid has exited 
>>>>> at
>>>>> this point, then the test should succeed.
>>>>> 
>>>>> Instead of testing against zero, the code should catch TimeoutException, 
>>>>> which
>>>>> means that rmid is still running. It's probably reasonable to catch
>>>>> TimeoutException, print a message, and then let the finally-block destroy 
>>>>> the
>>>>> rmid. Calling TestLibrary.bomb() from within the try-block is confusing, 
>>>>> since
>>>>> that method throws an exception, which is then caught by the catch-block, 
>>>>> when
>>>>> then calls TestLibrary.bomb() again.
>>>>> 
>>>>> We should also make sure to test the success case properly. If 
>>>>> rmid.waitFor()
>>>>> returns in a timely fashion without throwing TimeoutException, it doesn't
>>>>> matter what the exit status is. (It might be worth printing it out.) At 
>>>>> that
>>>>> point we know that rmid *has* exited gracefully, so we need to set rmid to
>>>>> null so that the finally-block doesn't attempt to destroy rmid 
>>>>> redundantly.
>>>>> Some additional messages about rmid having exited and the test passing are
>>>>> also warranted for this case.
>>>>> 
>>>>> Some additional cleanup can be done here as well, over and above the 
>>>>> changes
>>>>> you've proposed. (This stuff is left over from earlier RMI test messes.) 
>>>>> In
>>>>> order to shut down an active object, the code here spawns a new thread 
>>>>> that
>>>>> sleeps for a while and then deactivates this object. This isn't 
>>>>> necessary. (It
>>>>> might have been necessary in the past.) It's sufficient simply to unexport
>>>>> this object and then deactivate it, directly within the shutdown() 
>>>>> method. See
>>>>> 
>>>>> test/java/rmi/activation/ActivationSystem/unregisterGroup/UnregisterGroup.java
>>>>> 
>>>>> for an example of this. In addition, the run() method can be removed, and 
>>>>> the
>>>>> "implements Runnable" declaration can also be removed from the
>>>>> ShutdownGracefully test class.
>>>>> 
>>>>> Finally, revisiting some code farther up in the test, the try-block at 
>>>>> lines
>>>>> 135-140 issues a registration request that the test expects to fail. If it
>>>>> succeeds, the message at line 139 isn't very clear; it should say that the
>>>>> registration request succeeded unexpectedly. This should cause the test to
>>>>> fail. We still probably want to go through the waitFor(timeout) path and
>>>>> eventual rmid cleanup, but a flag should be set here to ensure that the 
>>>>> test
>>>>> indeed fails if the registration succeeds unexpectedly, and the messages
>>>>> should clearly indicate that this is going on.
>>>>> 
>>>>> A good way to test this last case is to change rmid's security manager to 
>>>>> the
>>>>> normal security manager java.lang.SecurityManager instead of
>>>>> TestSecurityManager.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> s'marks
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/10/14 2:59 AM, Tristan Yan wrote:
>>>>>> Hi Stuart
>>>>>> Could you help to review this.
>>>>>> Thank you
>>>>>> Tristan
>>>>>> 
>>>>>> On Jan 31, 2014, at 4:36 PM, Tristan Yan <tristan....@oracle.com
>>>>>> <mailto:tristan....@oracle.com>
>>>>>> <mailto:tristan....@oracle.com>> wrote:
>>>>>> 
>>>>>>> Thank you for fixing JDK-8023541. Then I leave ActivationLibrary.java 
>>>>>>> for
>>>>>>> now.
>>>>>>> I still did some clean up following your suggestion.
>>>>>>> 1. I changed waitFor(long timeout) method, this method is going to use 
>>>>>>> code
>>>>>>> like Process.waitFor(timeout, unit). This can be backported to JDK7. 
>>>>>>> Also
>>>>>>> exitValue is kept as a return value. For making sure there is no Pipe 
>>>>>>> leak, a
>>>>>>> cleanup thread will start when timeout happens.
>>>>>>> 2. Change in ShutdownGracefully is a little tricky. I think we should 
>>>>>>> just
>>>>>>> destroy JVM once exception is thrown. So I move the wait logic into try 
>>>>>>> block
>>>>>>> instead keep them in finally block.
>>>>>>> Can you receive it again.
>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.02/
>>>>>>> Thank you
>>>>>>> Tristan
>>>>>>> 
>>>>>>> On 01/29/2014 03:16 PM, Stuart Marks wrote:
>>>>>>>> Hi Tristan,
>>>>>>>> 
>>>>>>>> I don't want to put the workaround into ActivationLibrary.rmidRunning()
>>>>>>>> for a
>>>>>>>> null return from the lookup, because this is only a workaround for an 
>>>>>>>> actual
>>>>>>>> bug in rmid initialization. See the review I just posted for 
>>>>>>>> JDK-8023541.
>>>>>>>> 
>>>>>>>> Adding JavaVM.waitFor(timeout) is something that would be useful in 
>>>>>>>> general,
>>>>>>>> but it needs to be handled carefully. It uses the new
>>>>>>>> Process.waitFor(timeout, unit) which is new in Java SE 8; this makes
>>>>>>>> backporting to 7 more complicated. Not clear whether we'll do so, but I
>>>>>>>> don't
>>>>>>>> want to forclose the opportunity without discussion. It's also not 
>>>>>>>> clear how
>>>>>>>> one can get the vm's exit status after JavaVM.waitFor() has returned 
>>>>>>>> true.
>>>>>>>> With the Process API it's possible simply to call waitFor() or 
>>>>>>>> exitValue().
>>>>>>>> With JavaVM, a new API needs to be created, or the rule has to be
>>>>>>>> established
>>>>>>>> that one must call JavaVM.waitFor() to collect the exit status as well 
>>>>>>>> as to
>>>>>>>> close the pipes from the subprocess. If JavaVM.waitFor(timeout, unit) 
>>>>>>>> is
>>>>>>>> called without subsequently calling JavaVM.waitFor(), the pipes are 
>>>>>>>> leaked.
>>>>>>>> 
>>>>>>>> In ShutdownGracefully.java, the finally-block needs to check to see if 
>>>>>>>> rmid
>>>>>>>> is still running, and if it is, to shut it down. Simply calling
>>>>>>>> waitFor(timeout, unit) isn't sufficient, because if the rmid process is
>>>>>>>> still
>>>>>>>> running, it will be left running.
>>>>>>>> 
>>>>>>>> The straightforward approach would be to call
>>>>>>>> ActivationLibrary.rmidRunning()
>>>>>>>> to test if it's still running. Unfortunately this isn't quite right, 
>>>>>>>> because
>>>>>>>> rmidRunning() has a timeout loop in it -- which should probably be 
>>>>>>>> removed.
>>>>>>>> (I think there's a bug for this.) Another approach would be simply to 
>>>>>>>> call
>>>>>>>> rmid.destroy(). This calls rmid's shutdown() method first, which is
>>>>>>>> reasonable, but I'm not sure it kills the process if that fails. In any
>>>>>>>> case,
>>>>>>>> this already has a timeout loop waiting for the process to die, so
>>>>>>>> ShutdownGracefully.java needn't use a new waitFor(timeout, unit) call.
>>>>>>>> 
>>>>>>>> Removing the commented-out code that starts with "no longer needed" is 
>>>>>>>> good,
>>>>>>>> and removing the ShutdownDetectThread is also good, since that's
>>>>>>>> unnecessary.
>>>>>>>> 
>>>>>>>> There are some more cleanups I have in mind here but I'd like to see a
>>>>>>>> revised webrev before proceeding.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> s'marks
>>>>>>>> 
>>>>>>>> On 1/25/14 8:57 PM, Tristan Yan wrote:
>>>>>>>>> Hi Stuart
>>>>>>>>> Thank you for your review and suggestion.
>>>>>>>>> Yes, since this failure mode is very hard to be reproduced. I guess 
>>>>>>>>> it's
>>>>>>>>> make sense  to do some hack. And I also noticed in
>>>>>>>>> ActivationLibrary.rmidRunning. It does try to look up 
>>>>>>>>> ActivationSystem but
>>>>>>>>> doesn't check if it's null. So I add the logic to make sure we will 
>>>>>>>>> look up
>>>>>>>>> the non-null ActivationSystem. Also I did some cleanup if you don't 
>>>>>>>>> mind.
>>>>>>>>> Add a waitFor(long timeout, TimeUnit unit) for JavaVM. Which we can 
>>>>>>>>> have a
>>>>>>>>> better waitFor control.
>>>>>>>>> I appreciate you can review the code again.
>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
>>>>>>>>> Thank you
>>>>>>>>> Tristan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>>>>>>>>>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>>>>>>>>>> Hi All
>>>>>>>>>>> Could you review the bug fix for JDK-8032050.
>>>>>>>>>>> 
>>>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>>>>>>>>> 
>>>>>>>>>>> Description:
>>>>>>>>>>> This rare happened failure caused because when RMID starts. It don't
>>>>>>>>>>> guarantee
>>>>>>>>>>> sun.rmi.server.Activation.startActivation finishes.
>>>>>>>>>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>>>>>>>>> 
>>>>>>>>>> Hi Tristan,
>>>>>>>>>> 
>>>>>>>>>> Adding a timing/retry loop into this test isn't the correct approach 
>>>>>>>>>> for
>>>>>>>>>> fixing this test.
>>>>>>>>>> 
>>>>>>>>>> The timing loop isn't necessary because there is already a timing 
>>>>>>>>>> loop in
>>>>>>>>>> RMID.start() in the RMI test library. (There's another timing loop in
>>>>>>>>>> ActivationLibrary.rmidRunning() which should probably be removed.) 
>>>>>>>>>> So the
>>>>>>>>>> intent of this library call is that it start rmid and wait for it to
>>>>>>>>>> become
>>>>>>>>>> ready. That logic doesn't need to be added to the test.
>>>>>>>>>> 
>>>>>>>>>> In the bug report JDK-8032050 you had mentioned that the
>>>>>>>>>> NullPointerException was suspicious. You're right! I took a look and 
>>>>>>>>>> it
>>>>>>>>>> seemed like it was related to JDK-8023541, and I added a note to this
>>>>>>>>>> effect to the bug report. The problem here is that rmid can come up 
>>>>>>>>>> and
>>>>>>>>>> transiently return null instead of the stub of the activation system.
>>>>>>>>>> That's what JDK-8023541 covers. I think that rmid itself needs to be
>>>>>>>>>> fixed,
>>>>>>>>>> though modifying the timing loop in the RMI test library to wait for 
>>>>>>>>>> rmid
>>>>>>>>>> to come up *and* for the lookup to return non-null is an easy way to 
>>>>>>>>>> fix
>>>>>>>>>> the problem. (Or at least cover it up.)
>>>>>>>>>> 
>>>>>>>>>> The next step in the analysis is to determine, given that
>>>>>>>>>> ActivationLibrary.getSystem can sometimes return null, whether this 
>>>>>>>>>> has
>>>>>>>>>> actually caused this test failure. This is pretty easy to determine; 
>>>>>>>>>> just
>>>>>>>>>> hack in a line "system = null" in the right place and run the test. 
>>>>>>>>>> I've
>>>>>>>>>> done this, and the test times out and the output log is pretty much
>>>>>>>>>> identical to the one in the bug report. (I recommend you try this
>>>>>>>>>> yourself.) So I think it's fairly safe to say that the problem in
>>>>>>>>>> JDK-8023541 has caused the failure listed in JDK-8032050.
>>>>>>>>>> 
>>>>>>>>>> I can see a couple ways to proceed here. One way is just to close 
>>>>>>>>>> this out
>>>>>>>>>> as a duplicate of JDK-8023541 since that bug caused this failure.
>>>>>>>>>> 
>>>>>>>>>> Another is that this test could use some cleaning up. This bug 
>>>>>>>>>> certainly
>>>>>>>>>> covers a failure, but the messages emitted are confusing and in some 
>>>>>>>>>> cases
>>>>>>>>>> completely wrong. For example, the "rmid has shutdown" message at 
>>>>>>>>>> line 180
>>>>>>>>>> is incorrect, because in this case rmid is still running and the 
>>>>>>>>>> wait()
>>>>>>>>>> call has timed out. Most of the code here can be replaced with calls 
>>>>>>>>>> to
>>>>>>>>>> various bits of the RMI test library. There are a bunch of other 
>>>>>>>>>> things in
>>>>>>>>>> this test that could be cleaned up as well.
>>>>>>>>>> 
>>>>>>>>>> It's up to you how you'd like to proceed.
>>>>>>>>>> 
>>>>>>>>>> s'marks
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>

Re: RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently

Reply via email to