Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi Martin , apologies for the delay , was trying to get help for hosting my webrev. . Please see inline text. On 11/19/13, 10:35 PM, Martin Buchholz wrote: Hi Kalyan, None of us can review your changes yet because you haven't given us a URL of your webrev. It is located here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ I've tried to make the jsr166 copy of CancelledLockLoops fail by adjusting ITERS and TIMEOUT radically up and down, but the test just keeps on passing for me. Hints appreciated. Bump up the timeout to 500ms and you will see a failure (i can see it consistently on my machine Linux 64bit,8GBRAM,dual cores, with JDK 1.8 latest any promoted build). On Tue, Nov 19, 2013 at 6:39 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com mailto:srikalyan.chandrashe...@oracle.com wrote: Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines This doesn't look like a permanent fix - it just makes the failing case rarer. Thats true , the other way is to make the main thread wait on TIMEOUT after firing the interrupts instead of other way round, but that would be over-optimization which probably is not desirable as well. The 50 ms was arrived at empirically after running several 100 times on multiple configurations and did not cause failures. -- Thanks kalyan Ph: (408)-585-8040
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi David , webrev is hosted here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ . -- Thanks kalyan Ph: (408)-585-8040 On 11/19/13, 11:03 AM, David Holmes wrote: Hi, Attachments are stripped. Please post on cr.openjdk.java.net (get a colleague to host this if you don't have an account yet.) Thanks, David On 19/11/2013 4:12 PM, srikalyan chandrashekar wrote: Hi all, I am working on bug JDK-6772009 https://bugs.openjdk.java.net/browse/JDK-6772009 . Root Cause: The timeout value gives too much grace period on faster machines on which the TO BE INTERRUPTED threads complete faster before being interrupted at right time. Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines , and ensures the threads will be canceled when running and anyways there is a Barrier to ensure the test threads all complete gracefully. Miscellaneous fixes b) Convert result from int to long(possible integer overflow otherwise) c) Catch BrokenBarrierException in more granular fashion in ReentrantLockLoop to update and print the results (which will be missed otherwise) Add more diagnostics d) Assign names to threads e) Print which threads update the 'completed' variable. I have attached the webrev for convenience as the changes are interleaved and is best viewed as sdiff. Please let me know if you have any comments or suggestions. Thank you
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi David, Thanks for the review, the new webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6772009-CancelledLockLoop/ . Please see inline text. On 11/20/13, 6:35 PM, David Holmes wrote: On 21/11/2013 10:28 AM, Martin Buchholz wrote: I again tried and failed to reproduce a failure. Even if I go whole hog and multiply TIMEOUT by 100 and divide ITERS by 100, the test continues to PASS. Is it just me?! I think you are going the wrong way Martin - you want the timeout to be smaller than the time it takes to execute ITERS. I don't think there's any reason to make result long. It's not even used except to inhibit hotspot optimizations. +private volatile long result = 17;//Can get int overflow,so using long Further the subsequent use of += is incorrect as it is not an atomic operation. Even if we don't care about the value, it looks bad. Made the necessary changes for atomic update. I'm not sure result must be updated if we get a BrokenBarrierException either. Probably harmless, but necessary? I retained it in the fix for completeness in updating the numbers, please let me know if you still think otherwise. need to fix spelling and spacing below. +barrier.await();//If a BrokeBarrierException happens here(due to There are a number of style issues with spacing around comments. Fixed the spelling error and styling issues. And I don't think this change is sufficient to claim co-author status with Doug either ;-) Removed the claim :) The additional tracing may be useful but seems stylistically different from the rest of the code. Retained the tracking to understand if it is again the timing issue which is the cause in an event of a failure, however i can remove it if you think it is not necessary (OR) include an alternate solution as you may want to suggest. Overall I'm suspicious that the changed timeout actually fixes anything - normally we need to add longer timeouts not shorter ones. Does this fail on a range of machines or only specific ones? Have we verified that the clocks/timers are behaving properly on those systems? Here the time out is not about waiting for threads to complete something but to be interrupted before being considered done, so we decreased the timeout. However we now chose to increase the number of iterations to 500 from 100(thanks to tristan for the suggestion) instead of decreasing the timeout as done earlier because the increasing iterations ensures the threads are busy for long time curtailing the need to touch the timeout. Thanks, David -- Thanks kalyan Ph: (408)-585-8040 On Wed, Nov 20, 2013 at 11:52 AM, srikalyan srikalyan.chandrashe...@oracle.com wrote: Hi Martin , apologies for the delay , was trying to get help for hosting my webrev. . Please see inline text. On 11/19/13, 10:35 PM, Martin Buchholz wrote: Hi Kalyan, None of us can review your changes yet because you haven't given us a URL of your webrev. It is located here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ I've tried to make the jsr166 copy of CancelledLockLoops fail by adjusting ITERS and TIMEOUT radically up and down, but the test just keeps on passing for me. Hints appreciated. Bump up the timeout to 500ms and you will see a failure (i can see it consistently on my machine Linux 64bit,8GBRAM,dual cores, with JDK 1.8 latest any promoted build). On Tue, Nov 19, 2013 at 6:39 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com wrote: Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines This doesn't look like a permanent fix - it just makes the failing case rarer. Thats true , the other way is to make the main thread wait on TIMEOUT after firing the interrupts instead of other way round, but that would be over-optimization which probably is not desirable as well. The 50 ms was arrived at empirically after running several 100 times on multiple configurations and did not cause failures. -- Thanks kalyan Ph: (408)-585-8040
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi David/Martin a gentle reminder for review. -- Thanks kalyan Ph: (408)-585-8040 On 12/2/13, 11:21 AM, srikalyan wrote: Hi David, Thanks for the review, the new webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6772009-CancelledLockLoop/ . Please see inline text. On 11/20/13, 6:35 PM, David Holmes wrote: On 21/11/2013 10:28 AM, Martin Buchholz wrote: I again tried and failed to reproduce a failure. Even if I go whole hog and multiply TIMEOUT by 100 and divide ITERS by 100, the test continues to PASS. Is it just me?! I think you are going the wrong way Martin - you want the timeout to be smaller than the time it takes to execute ITERS. I don't think there's any reason to make result long. It's not even used except to inhibit hotspot optimizations. +private volatile long result = 17;//Can get int overflow,so using long Further the subsequent use of += is incorrect as it is not an atomic operation. Even if we don't care about the value, it looks bad. Made the necessary changes for atomic update. I'm not sure result must be updated if we get a BrokenBarrierException either. Probably harmless, but necessary? I retained it in the fix for completeness in updating the numbers, please let me know if you still think otherwise. need to fix spelling and spacing below. +barrier.await();//If a BrokeBarrierException happens here(due to There are a number of style issues with spacing around comments. Fixed the spelling error and styling issues. And I don't think this change is sufficient to claim co-author status with Doug either ;-) Removed the claim :) The additional tracing may be useful but seems stylistically different from the rest of the code. Retained the tracking to understand if it is again the timing issue which is the cause in an event of a failure, however i can remove it if you think it is not necessary (OR) include an alternate solution as you may want to suggest. Overall I'm suspicious that the changed timeout actually fixes anything - normally we need to add longer timeouts not shorter ones. Does this fail on a range of machines or only specific ones? Have we verified that the clocks/timers are behaving properly on those systems? Here the time out is not about waiting for threads to complete something but to be interrupted before being considered done, so we decreased the timeout. However we now chose to increase the number of iterations to 500 from 100(thanks to tristan for the suggestion) instead of decreasing the timeout as done earlier because the increasing iterations ensures the threads are busy for long time curtailing the need to touch the timeout. Thanks, David -- Thanks kalyan Ph: (408)-585-8040 On Wed, Nov 20, 2013 at 11:52 AM, srikalyan srikalyan.chandrashe...@oracle.com wrote: Hi Martin , apologies for the delay , was trying to get help for hosting my webrev. . Please see inline text. On 11/19/13, 10:35 PM, Martin Buchholz wrote: Hi Kalyan, None of us can review your changes yet because you haven't given us a URL of your webrev. It is located here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ I've tried to make the jsr166 copy of CancelledLockLoops fail by adjusting ITERS and TIMEOUT radically up and down, but the test just keeps on passing for me. Hints appreciated. Bump up the timeout to 500ms and you will see a failure (i can see it consistently on my machine Linux 64bit,8GBRAM,dual cores, with JDK 1.8 latest any promoted build). On Tue, Nov 19, 2013 at 6:39 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com wrote: Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines This doesn't look like a permanent fix - it just makes the failing case rarer. Thats true , the other way is to make the main thread wait on TIMEOUT after firing the interrupts instead of other way round, but that would be over-optimization which probably is not desirable as well. The 50 ms was arrived at empirically after running several 100 times on multiple configurations and did not cause failures. -- Thanks kalyan Ph: (408)-585-8040
Re: RFR for JDK-6963118 Intermittent test failure: test/java/nio/channels/Selector/Wakeup.java fail intermittently (win)
Hi all, a gentle reminder for review. -- Thanks kalyan Ph: (408)-585-8040 On 12/2/13, 6:39 PM, srikalyan chandrashekar wrote: Hi all, I am working on bug JDK-6963118 https://bugs.openjdk.java.net/browse/JDK-6963118 . Root Cause: - Sensitive timing dependency between events in Main and Sleeper threads are causes for test failure. Suggested Fix: 1) Main thread should wait for more than 1sec(made it 3sec) and check more often than 50ms(made it 1ms) intervals , sleeper thread may be still waiting for interrupt/wakeup hence main thread waiting for just 1sec to flag a failure is premature . The reason is especially on windows high priority virus scanners etc run(we faced it when simulating failures) and kept the system busy. 2) The test is essentially a sequence of 2 events a)Firing up wakeups/interrupts followed by a b)Check Check the sleeper.entries value and yield the main thread as required so that the above 2 events step in tandem. The webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6963118-Wakeup/ . Please let me know if you have any comments or suggestions. -- -- Thanks kalyan
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi David, i retained only the changes to ITERS, ProbleMList.txt and upstream changes by Doug Lea(as pointed by Martin), could you please review the new change available here http://cr.openjdk.java.net/~srikchan/Regression/6772009-CancelledLockLoop-webrev/ . -- Thanks kalyan Ph: (408)-585-8040 On 12/19/13, 8:10 PM, David Holmes wrote: Sorry Kalyan but I don't see the need for all the incidental changes if the primary change is to just increase the iterations. I also don't see why you need to do anything for BrokenBarrierException as it is not expected to happen and the test should just fail if it does. David On 10/12/2013 6:15 AM, srikalyan wrote: Hi David/Martin a gentle reminder for review. -- Thanks kalyan Ph: (408)-585-8040 On 12/2/13, 11:21 AM, srikalyan wrote: Hi David, Thanks for the review, the new webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6772009-CancelledLockLoop/ . Please see inline text. On 11/20/13, 6:35 PM, David Holmes wrote: On 21/11/2013 10:28 AM, Martin Buchholz wrote: I again tried and failed to reproduce a failure. Even if I go whole hog and multiply TIMEOUT by 100 and divide ITERS by 100, the test continues to PASS. Is it just me?! I think you are going the wrong way Martin - you want the timeout to be smaller than the time it takes to execute ITERS. I don't think there's any reason to make result long. It's not even used except to inhibit hotspot optimizations. +private volatile long result = 17;//Can get int overflow,so using long Further the subsequent use of += is incorrect as it is not an atomic operation. Even if we don't care about the value, it looks bad. Made the necessary changes for atomic update. I'm not sure result must be updated if we get a BrokenBarrierException either. Probably harmless, but necessary? I retained it in the fix for completeness in updating the numbers, please let me know if you still think otherwise. need to fix spelling and spacing below. +barrier.await();//If a BrokeBarrierException happens here(due to There are a number of style issues with spacing around comments. Fixed the spelling error and styling issues. And I don't think this change is sufficient to claim co-author status with Doug either ;-) Removed the claim :) The additional tracing may be useful but seems stylistically different from the rest of the code. Retained the tracking to understand if it is again the timing issue which is the cause in an event of a failure, however i can remove it if you think it is not necessary (OR) include an alternate solution as you may want to suggest. Overall I'm suspicious that the changed timeout actually fixes anything - normally we need to add longer timeouts not shorter ones. Does this fail on a range of machines or only specific ones? Have we verified that the clocks/timers are behaving properly on those systems? Here the time out is not about waiting for threads to complete something but to be interrupted before being considered done, so we decreased the timeout. However we now chose to increase the number of iterations to 500 from 100(thanks to tristan for the suggestion) instead of decreasing the timeout as done earlier because the increasing iterations ensures the threads are busy for long time curtailing the need to touch the timeout. Thanks, David -- Thanks kalyan Ph: (408)-585-8040 On Wed, Nov 20, 2013 at 11:52 AM, srikalyan srikalyan.chandrashe...@oracle.com wrote: Hi Martin , apologies for the delay , was trying to get help for hosting my webrev. . Please see inline text. On 11/19/13, 10:35 PM, Martin Buchholz wrote: Hi Kalyan, None of us can review your changes yet because you haven't given us a URL of your webrev. It is located here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ I've tried to make the jsr166 copy of CancelledLockLoops fail by adjusting ITERS and TIMEOUT radically up and down, but the test just keeps on passing for me. Hints appreciated. Bump up the timeout to 500ms and you will see a failure (i can see it consistently on my machine Linux 64bit,8GBRAM,dual cores, with JDK 1.8 latest any promoted build). On Tue, Nov 19, 2013 at 6:39 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com wrote: Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines This doesn't look like a permanent fix - it just makes the failing case rarer. Thats true , the other way is to make the main thread wait on TIMEOUT after firing the interrupts instead of other way round, but that would be over-optimization which probably is not desirable as well. The 50 ms was arrived at empirically after running several 100 times on multiple configurations and did not cause failures. -- Thanks kalyan Ph: (408)-585-8040
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Mandy, yes I ran with JTreg to simulate the failure, i will try the UEH patch to see if it sheds some light and get back to you. Thanks for the direction :) -- Thanks kalyan Ph: (408)-585-8040 On 12/19/13, 8:33 PM, Mandy Chung wrote: Hi Srikalyan, Maybe you can get add an uncaught handler to see if you can get any information. I ran it for 1000 times but not able to duplicate the failure. Did you run it with jtreg (I didn't)? Below is the patch to install a thread's uncaught handler that you can take and try. diff --git a/test/java/lang/ref/OOMEInReferenceHandler.java b/test/java/lang/ref/OOMEInReferenceHand ler.java --- a/test/java/lang/ref/OOMEInReferenceHandler.java +++ b/test/java/lang/ref/OOMEInReferenceHandler.java @@ -51,6 +51,14 @@ return first; } + static class UEH implements Thread.UncaughtExceptionHandler { + public void uncaughtException(Thread t, Throwable e) { + System.err.println(ERROR: + t.getName() + exception + + e.getMessage()); + e.printStackTrace(); + } + } + public static void main(String[] args) throws Exception { // preinitialize the InterruptedException class so that the reference handler // does not die due to OOME when loading the class if it is the first use @@ -77,6 +85,8 @@ throw new IllegalStateException(Couldn't find Reference Handler thread.); } + referenceHandlerThread.setUncaughtExceptionHandler(new UEH()); + ReferenceQueueObject refQueue = new ReferenceQueue(); Object referent = new Object(); WeakReferenceObject weakRef = new WeakReference(referent, refQueue); On 12/19/2013 6:57 PM, srikalyan chandrashekar wrote: Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2) ReferenceQueue.enqueue() 1) The cleanup code in turn has 2 places where there is potential for throwing OOME, a) thunk Thread which is run from clean() method. This Runnable is passed to Cleaner and appears in the following classes java/nio/DirectByteBuffer.java sun/misc/Perf.java sun/nio/fs/NativeBuffer.java sun/nio/ch/IOVecWrapper.java sun/misc/Cleaner/ExitOnThrow.java However none of the above overridden implementations ever create an object in the clean() code. b) new PrivilegedAction created in try catch Exception block of clean() method but for this object to be created and to be held responsible for OOME an Exception(other than OOME) has to be thrown. 2) No new heap objects are created in the enqueue method nor anywhere in the deep call stack (VM.addFinalRefCount() etc) so this cannot be a potential cause. *Experimental change to java.lang.Reference.java* : - Put one more guard (try catch with OOME block) in the Reference Handler Thread which may give the Reference Handler a chance to cleanup. This is fixing the test failure (several 1000 runs with 0 failures) - Without the above change the test fails atleast 3-5 times
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
On 1/11/14, 6:15 AM, Peter Levart wrote: On 01/10/2014 10:51 PM, srikalyan chandrashekar wrote: Hi Peter the version you provided ran indefinitely(i put a 10 minute timeout) and the program got interrupted(no error), Did you run it with or without fastedbug -XX:+TraceExceptions ? If with, it might be that fastdebug and/or -XX:+TraceExceptions changes the execution a bit so that we can no longer reproduce the wrong behaviour. With fastdebug -XX:TraceExceptions. I will try combination of possible options(i.e without -XX:TraceEception on debug build etc) soon. even if there were to be an error you cannot print the string of thread to console(these have been attempted earlier). ...it has been attempted to print toString in uncaught exception handler. At that time, the heap is still full. I'm printing it after the GC has cleared the heap. You can try that it works by commenting out the try { and corresponding } catch (OOME x) {} exception handler... Since there is a GC call prior to printing string i will give that a shot with non-debug build. - The test's running on interpreter mode, what i am watching for is one error with trace. Without fastdebug build and -XX:+TraceExceptions i am able to reproduce failure atleast 5 failures out of 1000 runs but with fastdebug+Trace no luck yet(already past few 1000 runs). It might be interesting to try with fastebug build but without the -XX:+TraceExceptions option to see what has an effect on it. It might also be interesting to try the modified ReferenceHandler (the one with private runImpl() method called from run()) and with normal non-fastdebug JDK. This info might be useful when one starts to inspect the exception handling code in interpreter... Regards, Peter -- Thanks kalyan Ph: (408)-585-8040 --- Thanks kalyan On 01/10/2014 02:57 AM, Peter Levart wrote: On 01/10/2014 09:31 AM, Peter Levart wrote: Since we suspect there's something wrong with exception handling in interpreter, I devised a hypothetical reproducer that tries to simulate ReferenceHandler in many aspects, but doesn't require to be a ReferenceHandler: http://cr.openjdk.java.net/~plevart/misc/OOME/OOMECatchingTest.java This is designed to run indefinitely and only terminate if/when thread dies. Could you run this program in the environment that causes the OOMEInReferenceHandler test to fail and see if it terminates? I forgot to mention that in order for this long-running program to exhibit interpreter behaviour, it should be run with -Xint option. So I suggest: -Xmx24M -XX:-UseTLAB -Xint Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi David On 1/15/14, 9:04 PM, David Holmes wrote: On 16/01/2014 10:19 AM, srikalyan chandrashekar wrote: Hi Peter/David, we could finally get a trace of exception with fastdebug build and ReferenceHandler modified (with runImpl() added and called from run()). The logs, disassembled code is available in JIRA https://bugs.openjdk.java.net/browse/JDK-8022321 as attachments. All I can see is the log for the OOMECatchingTest program not one for the actual ReferenceHandler ?? Please search for ReferenceHandler in the log. Observations from the log: Root Cause: 1) UncaughtException is being dispatched from Reference.java:143 141 ReferenceObject r; 142 synchronized (lock) { 143if (pending != null) { 144r = pending; 145pending = r.discovered; 146r.discovered = null; pending field in Reference is touched and updated by the collector, so at line 143 when the execution context is in Reference handler there might have been an Exception pending due to allocation done by collector which causes ReferenceHandler thread to die. Sorry but the GC does not trigger asynchronous exceptions so this explanation does not make any sense to me. What part of the log led you to this conclusion? -- Log Excerpt begins -- Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 168] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c600} 'runImpl' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 65 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c478} 'run' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 1 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddcaaf90} 'uncaughtException' '(Ljava/lang/Thread;Ljava/lang/Throwable;)V' in ' at bci 48 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddca7298} 'dispatchUncaughtException' '(Ljava/lang/Throwable;)V' in 'java/lang/ at bci 6 for thread 0x7feed80cf800 -- Log Excerpt ends -- Sorry if it is a wrong understanding. Suggested fix: - As proposed earlier putting an outer guard(try-catch on OOME) in the ReferenceHandler will fix the issue, if ReferenceHandler is considered as part of the GC sub system then it should be alive even in the midst of an OOME so i feel that the additional guard should be allowed, however i might still be ignorant of vital implications. - Apart from the above changes, Peter's suggestion to create and call a private runImpl() from run() in ReferenceHandler makes sense to me. Why would we need this? David - --- Thanks kalyan On 01/13/2014 03:57 PM, srikalyan wrote: On 1/11/14, 6:15 AM, Peter Levart wrote: On 01/10/2014 10:51 PM, srikalyan chandrashekar wrote: Hi Peter the version you provided ran indefinitely(i put a 10 minute timeout) and the program got interrupted(no error), Did you run it with or without fastedbug -XX:+TraceExceptions ? If with, it might be that fastdebug and/or -XX:+TraceExceptions changes the execution a bit so that we can no longer reproduce the wrong behaviour. With fastdebug -XX:TraceExceptions. I will try combination of possible options(i.e without -XX:TraceEception on debug build etc) soon. even if there were to be an error you cannot print the string of thread to console(these have been attempted earlier). ...it has been attempted to print toString in uncaught exception handler. At that time, the heap is still full. I'm printing it after the GC has cleared the heap. You can try that it works by commenting out the try { and corresponding } catch (OOME x) {} exception handler... Since there is a GC call prior to printing string i will give that a shot with non-debug build. - The test's running on interpreter mode, what i am watching for is one error with trace. Without fastdebug build and -XX:+TraceExceptions i am able to reproduce failure atleast 5 failures out of 1000 runs but with fastdebug+Trace no luck yet(already past few 1000 runs). It might be interesting to try with fastebug build but without the -XX:+TraceExceptions option to see what has an effect on it. It might also be interesting to try the modified ReferenceHandler
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter, i have modified your code from r = pending; if (r != null) { .. TO if (pending != null) { r = pending; This is because the r is used later in the code and must not be assigned pending unless it is not null(this was as is earlier). The new webrev is posted here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev-V2/ . I ran a 1000 run and no failures so far, however i would like to run a couple more 1000 runs to assert the fix. PS: The description section of JEP-122 (http://openjdk.java.net/jeps/122) says meta-data would be in native memory(not heap). -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 2:31 PM, Peter Levart wrote: On 01/21/2014 07:17 PM, srikalyan wrote: Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? The perm gen is not a problem her (JDK 8 does not have it and we see OOME on JDK8 too). Each time a class is loaded, new java.lang.Class object is allocated on heap. Regards, Peter Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter/David, we have 2000 runs without a single failure. -- Thanks kalyan Ph: (408)-585-8040 On 1/23/14, 12:10 PM, srikalyan wrote: Hi Peter, i have modified your code from r = pending; if (r != null) { .. TO if (pending != null) { r = pending; This is because the r is used later in the code and must not be assigned pending unless it is not null(this was as is earlier). The new webrev is posted here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev-V2/ . I ran a 1000 run and no failures so far, however i would like to run a couple more 1000 runs to assert the fix. PS: The description section of JEP-122 (http://openjdk.java.net/jeps/122) says meta-data would be in native memory(not heap). -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 2:31 PM, Peter Levart wrote: On 01/21/2014 07:17 PM, srikalyan wrote: Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? The perm gen is not a problem her (JDK 8 does not have it and we see OOME on JDK8 too). Each time a class is loaded, new java.lang.Class object is allocated on heap. Regards, Peter Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter
RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi all, I am working on bug JDK-6772009 https://bugs.openjdk.java.net/browse/JDK-6772009 . Root Cause: The timeout value gives too much grace period on faster machines on which the TO BE INTERRUPTED threads complete faster before being interrupted at right time. Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines , and ensures the threads will be canceled when running and anyways there is a Barrier to ensure the test threads all complete gracefully. Miscellaneous fixes b) Convert result from int to long(possible integer overflow otherwise) c) Catch BrokenBarrierException in more granular fashion in ReentrantLockLoop to update and print the results (which will be missed otherwise) Add more diagnostics d) Assign names to threads e) Print which threads update the 'completed' variable. I have attached the webrev for convenience as the changes are interleaved and is best viewed as sdiff. Please let me know if you have any comments or suggestions. Thank you -- -- Thanks kalyan
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi Martin, i incorporated the recent changes from the pointer as well. I have reproduced the failure, the logs of which are attached to the bug JDK-6772009 https://bugs.openjdk.java.net/browse/JDK-6772009 . The failed log is especially interesting . -- Thanks kalyan On 11/18/13 10:15 PM, Martin Buchholz wrote: Thanks for working on this. There have been some recent upstream changes to this test as well. Please incorporate them. http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/jtreg/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java?view=co The jsr166 maintainers haven't been able to reproduce any failures in this test. Do you have any hints that might help us? On Mon, Nov 18, 2013 at 10:12 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com mailto:srikalyan.chandrashe...@oracle.com wrote: Hi all, I am working on bug JDK-6772009 https://bugs.openjdk.java.net/browse/JDK-6772009 . Root Cause: The timeout value gives too much grace period on faster machines on which the TO BE INTERRUPTED threads complete faster before being interrupted at right time. Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines , and ensures the threads will be canceled when running and anyways there is a Barrier to ensure the test threads all complete gracefully. Miscellaneous fixes b) Convert result from int to long(possible integer overflow otherwise) c) Catch BrokenBarrierException in more granular fashion in ReentrantLockLoop to update and print the results (which will be missed otherwise) Add more diagnostics d) Assign names to threads e) Print which threads update the 'completed' variable. I have attached the webrev for convenience as the changes are interleaved and is best viewed as sdiff. Please let me know if you have any comments or suggestions. Thank you -- -- Thanks kalyan
RFR for JDK-6963118 Intermittent test failure: test/java/nio/channels/Selector/Wakeup.java fail intermittently (win)
Hi all, I am working on bug JDK-6963118 https://bugs.openjdk.java.net/browse/JDK-6963118 . Root Cause: - Sensitive timing dependency between events in Main and Sleeper threads are causes for test failure. Suggested Fix: 1) Main thread should wait for more than 1sec(made it 3sec) and check more often than 50ms(made it 1ms) intervals , sleeper thread may be still waiting for interrupt/wakeup hence main thread waiting for just 1sec to flag a failure is premature . The reason is especially on windows high priority virus scanners etc run(we faced it when simulating failures) and kept the system busy. 2) The test is essentially a sequence of 2 events a)Firing up wakeups/interrupts followed by a b)Check Check the sleeper.entries value and yield the main thread as required so that the above 2 events step in tandem. The webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6963118-Wakeup/ . Please let me know if you have any comments or suggestions. -- -- Thanks kalyan
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2) ReferenceQueue.enqueue() 1) The cleanup code in turn has 2 places where there is potential for throwing OOME, a) thunk Thread which is run from clean() method. This Runnable is passed to Cleaner and appears in the following classes java/nio/DirectByteBuffer.java sun/misc/Perf.java sun/nio/fs/NativeBuffer.java sun/nio/ch/IOVecWrapper.java sun/misc/Cleaner/ExitOnThrow.java However none of the above overridden implementations ever create an object in the clean() code. b) new PrivilegedAction created in try catch Exception block of clean() method but for this object to be created and to be held responsible for OOME an Exception(other than OOME) has to be thrown. 2) No new heap objects are created in the enqueue method nor anywhere in the deep call stack (VM.addFinalRefCount() etc) so this cannot be a potential cause. *Experimental change to java.lang.Reference.java* : - Put one more guard (try catch with OOME block) in the Reference Handler Thread which may give the Reference Handler a chance to cleanup. This is fixing the test failure (several 1000 runs with 0 failures) - Without the above change the test fails atleast 3-5 times for every 1000 run. *PS*: The code change is to a very critical part of JDK and i am fully not aware of the consequences of the change, hence seeking expert help here. Appreciate your time and inputs towards this.
Re: RFR for JDK-6772009 Intermittent test failure: java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java test failed with 'Completed != 2'
Hi David/Martin, could any one of you sponsor this change for me? --- Thanks kalyan On 12/20/2013 10:28 PM, David Holmes wrote: On 21/12/2013 4:19 AM, srikalyan wrote: Hi David, i retained only the changes to ITERS, ProbleMList.txt and upstream changes by Doug Lea(as pointed by Martin), could you please review the new change available here http://cr.openjdk.java.net/~srikchan/Regression/6772009-CancelledLockLoop-webrev/ . Ok. Thanks, David -- Thanks kalyan Ph: (408)-585-8040 On 12/19/13, 8:10 PM, David Holmes wrote: Sorry Kalyan but I don't see the need for all the incidental changes if the primary change is to just increase the iterations. I also don't see why you need to do anything for BrokenBarrierException as it is not expected to happen and the test should just fail if it does. David On 10/12/2013 6:15 AM, srikalyan wrote: Hi David/Martin a gentle reminder for review. -- Thanks kalyan Ph: (408)-585-8040 On 12/2/13, 11:21 AM, srikalyan wrote: Hi David, Thanks for the review, the new webrev is hosted at http://cr.openjdk.java.net/~cl/host_for_kal/6772009-CancelledLockLoop/ . Please see inline text. On 11/20/13, 6:35 PM, David Holmes wrote: On 21/11/2013 10:28 AM, Martin Buchholz wrote: I again tried and failed to reproduce a failure. Even if I go whole hog and multiply TIMEOUT by 100 and divide ITERS by 100, the test continues to PASS. Is it just me?! I think you are going the wrong way Martin - you want the timeout to be smaller than the time it takes to execute ITERS. I don't think there's any reason to make result long. It's not even used except to inhibit hotspot optimizations. +private volatile long result = 17;//Can get int overflow,so using long Further the subsequent use of += is incorrect as it is not an atomic operation. Even if we don't care about the value, it looks bad. Made the necessary changes for atomic update. I'm not sure result must be updated if we get a BrokenBarrierException either. Probably harmless, but necessary? I retained it in the fix for completeness in updating the numbers, please let me know if you still think otherwise. need to fix spelling and spacing below. +barrier.await();//If a BrokeBarrierException happens here(due to There are a number of style issues with spacing around comments. Fixed the spelling error and styling issues. And I don't think this change is sufficient to claim co-author status with Doug either ;-) Removed the claim :) The additional tracing may be useful but seems stylistically different from the rest of the code. Retained the tracking to understand if it is again the timing issue which is the cause in an event of a failure, however i can remove it if you think it is not necessary (OR) include an alternate solution as you may want to suggest. Overall I'm suspicious that the changed timeout actually fixes anything - normally we need to add longer timeouts not shorter ones. Does this fail on a range of machines or only specific ones? Have we verified that the clocks/timers are behaving properly on those systems? Here the time out is not about waiting for threads to complete something but to be interrupted before being considered done, so we decreased the timeout. However we now chose to increase the number of iterations to 500 from 100(thanks to tristan for the suggestion) instead of decreasing the timeout as done earlier because the increasing iterations ensures the threads are busy for long time curtailing the need to touch the timeout. Thanks, David -- Thanks kalyan Ph: (408)-585-8040 On Wed, Nov 20, 2013 at 11:52 AM, srikalyan srikalyan.chandrashe...@oracle.com wrote: Hi Martin , apologies for the delay , was trying to get help for hosting my webrev. . Please see inline text. On 11/19/13, 10:35 PM, Martin Buchholz wrote: Hi Kalyan, None of us can review your changes yet because you haven't given us a URL of your webrev. It is located here http://cr.openjdk.java.net/~cl/host_for_srikalyan_6772009_CancelledLockLoops/ I've tried to make the jsr166 copy of CancelledLockLoops fail by adjusting ITERS and TIMEOUT radically up and down, but the test just keeps on passing for me. Hints appreciated. Bump up the timeout to 500ms and you will see a failure (i can see it consistently on my machine Linux 64bit,8GBRAM,dual cores, with JDK 1.8 latest any promoted build). On Tue, Nov 19, 2013 at 6:39 PM, srikalyan chandrashekar srikalyan.chandrashe...@oracle.com wrote: Suggested Fix: a) Decrease the timeout from 100 to 50ms which will ensure that the test will pass even on faster machines This doesn't look like a permanent fix - it just makes the failing case rarer. Thats true , the other way is to make the main thread wait on TIMEOUT after firing the interrupts instead of other way round, but that would be over-optimization which probably is not desirable as well. The 50 ms was arrived at empirically
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Mandy, after some trials i could simulate the failure again (now with UEH in place), however the UEH now cannot print enough details as it also tries to allocate memory, when it does Thread.getName()(it internally creates a String object), printStackTrace() also creates new WrappedPrintStream object. See the following trace Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread Reference Handler ERROR: java.lang.Exception: Reference Handler thread died. at OOMEInReferenceHandler.main(OOMEInReferenceHandler.java:105) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.sun.javatest.regtest.MainWrapper$MainThread.run(MainWrapper.java:94) at java.lang.Thread.run(Thread.java:744) Meanwhile i am trying looking around to actually print something useful without allocating any new memory. --- Thanks kalyan On 12/20/2013 01:00 PM, srikalyan wrote: Hi Mandy, yes I ran with JTreg to simulate the failure, i will try the UEH patch to see if it sheds some light and get back to you. Thanks for the direction :) -- Thanks kalyan Ph: (408)-585-8040 On 12/19/13, 8:33 PM, Mandy Chung wrote: Hi Srikalyan, Maybe you can get add an uncaught handler to see if you can get any information. I ran it for 1000 times but not able to duplicate the failure. Did you run it with jtreg (I didn't)? Below is the patch to install a thread's uncaught handler that you can take and try. diff --git a/test/java/lang/ref/OOMEInReferenceHandler.java b/test/java/lang/ref/OOMEInReferenceHand ler.java --- a/test/java/lang/ref/OOMEInReferenceHandler.java +++ b/test/java/lang/ref/OOMEInReferenceHandler.java @@ -51,6 +51,14 @@ return first; } + static class UEH implements Thread.UncaughtExceptionHandler { + public void uncaughtException(Thread t, Throwable e) { + System.err.println(ERROR: + t.getName() + exception + + e.getMessage()); + e.printStackTrace(); + } + } + public static void main(String[] args) throws Exception { // preinitialize the InterruptedException class so that the reference handler // does not die due to OOME when loading the class if it is the first use @@ -77,6 +85,8 @@ throw new IllegalStateException(Couldn't find Reference Handler thread.); } + referenceHandlerThread.setUncaughtExceptionHandler(new UEH()); + ReferenceQueueObject refQueue = new ReferenceQueue(); Object referent = new Object(); WeakReferenceObject weakRef = new WeakReference(referent, refQueue); On 12/19/2013 6:57 PM, srikalyan chandrashekar wrote: Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Sure David will give that a try, we have so far attempted to 1. Print state data(as per the test creator peter.levart's inputs), 2. Use UEH(uncaught exception handler per Mandy's inputs) -- Thanks kalyan On 1/6/14 4:40 PM, David Holmes wrote: Back from vacation ... On 20/12/2013 4:49 PM, David Holmes wrote: On 20/12/2013 12:57 PM, srikalyan chandrashekar wrote: Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. The ReferenceHandler thread can only get OOME if it allocates (directly or indirectly) - so there has to be something in the unguarded part that causes this. Again it may be an implicit action in the VM - similar to the class load issue for InterruptedException. Run a debug VM with -XX:+TraceExceptions to see where the OOME is triggered. David - David I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2) ReferenceQueue.enqueue() 1) The cleanup code in turn has 2 places where there is potential for throwing OOME, a) thunk Thread which is run from clean() method. This Runnable is passed to Cleaner and appears in the following classes java/nio/DirectByteBuffer.java sun/misc/Perf.java sun/nio/fs/NativeBuffer.java sun/nio/ch/IOVecWrapper.java sun/misc/Cleaner/ExitOnThrow.java However none of the above overridden implementations ever create an object in the clean() code. b) new PrivilegedAction created in try catch Exception block of clean() method but for this object to be created and to be held responsible for OOME an Exception(other than OOME) has to be thrown. 2) No new heap objects are created in the enqueue method nor anywhere in the deep call stack (VM.addFinalRefCount() etc) so this cannot be a potential cause. *Experimental change to java.lang.Reference.java* : - Put one more guard (try catch with OOME block) in the Reference Handler Thread which may give the Reference Handler a chance to cleanup. This is fixing the test failure (several 1000 runs with 0 failures) - Without the above change the test fails atleast 3-5 times for every 1000 run. *PS*: The code change is to a very critical part of JDK and i am fully not aware of the consequences of the change, hence seeking expert help here. Appreciate your time and inputs towards this.
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Peter, getting state info out(to console or otherwise) from within Reference Handler's exceptions handlers have been unsuccessful. However David's suggestion produced some useful trace with fast debug build and could get some information , see the log here http://cr.openjdk.java.net/%7Esrikchan/OOME_exception_trace.log . --- Thanks kalyan On 01/07/2014 12:42 AM, Peter Levart wrote: On 01/07/2014 03:15 AM, srikalyan chandrashekar wrote: Sure David will give that a try, we have so far attempted to 1. Print state data(as per the test creator peter.levart's inputs), Hi Kalyan, Have you been able to reproduce the OOME in that set-up? What was the result? Regards, Peter 2. Use UEH(uncaught exception handler per Mandy's inputs) -- Thanks kalyan On 1/6/14 4:40 PM, David Holmes wrote: Back from vacation ... On 20/12/2013 4:49 PM, David Holmes wrote: On 20/12/2013 12:57 PM, srikalyan chandrashekar wrote: Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. The ReferenceHandler thread can only get OOME if it allocates (directly or indirectly) - so there has to be something in the unguarded part that causes this. Again it may be an implicit action in the VM - similar to the class load issue for InterruptedException. Run a debug VM with -XX:+TraceExceptions to see where the OOME is triggered. David - David I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2) ReferenceQueue.enqueue() 1) The cleanup code in turn has 2 places where there is potential for throwing OOME, a) thunk Thread which is run from clean() method. This Runnable is passed to Cleaner and appears in the following classes java/nio/DirectByteBuffer.java sun/misc/Perf.java sun/nio/fs/NativeBuffer.java sun/nio/ch/IOVecWrapper.java sun/misc/Cleaner/ExitOnThrow.java However none of the above overridden implementations ever create an object in the clean() code. b) new PrivilegedAction created in try catch Exception block of clean() method but for this object to be created and to be held responsible for OOME an Exception(other than OOME) has to be thrown. 2) No new heap objects are created in the enqueue method nor anywhere in the deep call stack (VM.addFinalRefCount() etc) so this cannot be a potential cause. *Experimental change to java.lang.Reference.java* : - Put one more guard (try catch with OOME block) in the Reference Handler Thread which may give the Reference Handler a chance to cleanup. This is fixing the test failure (several 1000 runs with 0 failures) - Without the above change the test fails atleast 3-5 times for every 1000 run. *PS*: The code change is to a very critical part of JDK and i am fully not aware of the consequences of the change, hence seeking expert help here. Appreciate your time and inputs towards this.
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi David, TraceExceptions with fastdebug build produced some nice trace http://cr.openjdk.java.net/%7Esrikchan/OOME_exception_trace.log . The native method wait(long) is where the OOME if being thrown, the deepest call is in src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157 --- Excerpt Begins - 147 if (!gc_overhead_limit_was_exceeded) { 148// -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError support 149report_java_out_of_memory(Java heap space); 150 151if (JvmtiExport::should_post_resource_exhausted()) { 152 JvmtiExport::post_resource_exhausted( 153JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR | JVMTI_RESOURCE_EXHAUSTED_JAVA_HEAP, 154Java heap space); 155} 156 157THROW_OOP_0(Universe::out_of_memory_error_java_heap()); 158 } else { --- Excerpt Ends - Would be helpful if David/some one else in the team could explain the latent aspects/probable cause. --- Thanks kalyan On 01/06/2014 04:40 PM, David Holmes wrote: Back from vacation ... On 20/12/2013 4:49 PM, David Holmes wrote: On 20/12/2013 12:57 PM, srikalyan chandrashekar wrote: Hi David Thanks for your comments, the unguarded part(clean and enqueue) in the Reference Handler thread does not seem to create any new objects, so it is the application(the test in this case) which is adding objects to heap and causing the Reference Handler to die with OOME. The ReferenceHandler thread can only get OOME if it allocates (directly or indirectly) - so there has to be something in the unguarded part that causes this. Again it may be an implicit action in the VM - similar to the class load issue for InterruptedException. Run a debug VM with -XX:+TraceExceptions to see where the OOME is triggered. David - David I am still unsure about the side effects of the code change and agree with your thoughts(on memory exhaustion test's reliability). PS: hotspot dev alias removed from CC. -- Thanks kalyan On 12/19/13 5:08 PM, David Holmes wrote: Hi Kalyan, This is not a hotspot issue so I'm moving this to core-libs, please drop hotspot from any replies. On 20/12/2013 6:26 AM, srikalyan wrote: Hi all, I have been working on the bug JDK-8022321 https://bugs.openjdk.java.net/browse/JDK-8022321 , this is a sporadic failure and the webrev is available here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev/ I'm really not sure what to make of this. We have a test that triggers an out-of-memory condition but the OOME can actually turn up in the ReferenceHandler thread causing it to terminate and the test to fail. We previously accounted for the non-obvious occurrences of OOME due to the Object.wait and the possible need to load the InterruptedException class - but still the OOME can appear where we don't want it. So finally you have just placed the whole for(;;) loop in a try/catch(OOME) that ignores the OOME. I'm certain that makes the test happy, but I'm not sure it is really what we want for the ReferenceHandler thread. If the OOME occurs while cleaning, or enqueuing then we will fail to clean and/or enqueue but there would be no indication that has occurred and I think that is a bigger problem than this test failing. There may be no way to make this test 100% reliable. In fact I'd suggest that no memory exhaustion test can be 100% reliable. David * **Root Cause:Still not known* 2 places where there is a possibility for OOME 1) Cleaner.clean() 2) ReferenceQueue.enqueue() 1) The cleanup code in turn has 2 places where there is potential for throwing OOME, a) thunk Thread which is run from clean() method. This Runnable is passed to Cleaner and appears in the following classes java/nio/DirectByteBuffer.java sun/misc/Perf.java sun/nio/fs/NativeBuffer.java sun/nio/ch/IOVecWrapper.java sun/misc/Cleaner/ExitOnThrow.java However none of the above overridden implementations ever create an object in the clean() code. b) new PrivilegedAction created in try catch Exception block of clean() method but for this object to be created and to be held responsible for OOME an Exception(other than OOME) has to be thrown. 2) No new heap objects are created in the enqueue method nor anywhere in the deep call stack (VM.addFinalRefCount() etc) so this cannot be a potential cause. *Experimental change to java.lang.Reference.java* : - Put one more guard (try catch with OOME block) in the Reference Handler Thread which may give the Reference Handler a chance to cleanup. This is fixing the test failure (several 1000 runs with 0 failures) - Without the above change the test fails atleast 3-5 times for every 1000 run. *PS*: The code change is to a very critical part of JDK and i am fully not aware of the consequences of the change, hence seeking expert help here. Appreciate your time and inputs towards this.
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
David/Peter you are right, the logs trace came from passed run, i am trying to simulate the failure and get the logs for failed runs(2000+ runs done and still no failure), will get back to you once i have the data from failed run. Sorry for the confusion. --- Thanks kalyan On 01/08/2014 11:22 PM, David Holmes wrote: Thanks Peter. Kalyan: Can you confirm, as Peter asked, that the TraceExceptions output came from a failed run? AFAICS the Trace info is printed after each bytecode where there is a pending exception - though I'm not 100% sure on the printing within the VM runtime. Based on that I think we see the Trace output in run() at the point where wait() returns, so it may well be caught after that - in which case this was not a failing run. I also can't reproduce the problem :( David On 8/01/2014 10:34 PM, Peter Levart wrote: On 01/08/2014 07:30 AM, David Holmes wrote: On 8/01/2014 4:19 PM, David Holmes wrote: On 8/01/2014 7:33 AM, srikalyan chandrashekar wrote: Hi David, TraceExceptions with fastdebug build produced some nice trace http://cr.openjdk.java.net/%7Esrikchan/OOME_exception_trace.log . The native method wait(long) is where the OOME if being thrown, the deepest call is in src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157 Yes but it is the caller that is of interest: Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown [/HUDSON/workspace/8-2-build-linux-amd64/jdk8/1317/hotspot/src/share/vm/runtime/objectMonitor.cpp, line 1649] for thread 0x7f78c40d2800 Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown in interpreter method {method} {0x7f78b4800ae0} 'wait' '(J)V' in 'java/lang/Object' at bci 0 for thread 0x7f78c40d2800 The ReferenceHandler thread gets the OOME trying to allocate the InterruptedException. However we already have a catch block around the wait() so how is this OOME getting through? A bug in exception handling in the interpreter ?? Might be. And it may have something to do with the fact that the Thread.run() method is the 1st call frame on the thread's stack (seems like corner case). The last few meaningful TraceExceptions records are: Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown [/HUDSON/workspace/8-2-build-linux-amd64/jdk8/1317/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157] for thread 0x7f78c40d2800 Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown [/HUDSON/workspace/8-2-build-linux-amd64/jdk8/1317/hotspot/src/share/vm/runtime/objectMonitor.cpp, line 1649] for thread 0x7f78c40d2800 Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown in interpreter method {method} {0x7f78b4800ae0} 'wait' '(J)V' in 'java/lang/Object' at bci 0 for thread 0x7f78c40d2800 Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown in interpreter method {method} {0x7f78b4800ca8} 'wait' '()V' in 'java/lang/Object' at *bci 2* for thread 0x7f78c40d2800 Exception a 'java/lang/OutOfMemoryError' (0xd6a01840) thrown in interpreter method {method} {0x7f78b48d2250} 'run' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at *bci 36* for thread 0x7f78c40d2800 Here's the relevant bytecodes: public class java.lang.Object public final void wait() throws java.lang.InterruptedException; descriptor: ()V flags: ACC_PUBLIC, ACC_FINAL Code: stack=3, locals=1, args_size=1 0: aload_0 1: lconst_0 * 2: invokevirtual #73 // Method wait:(J)V* 5: return LineNumberTable: line 502: 0 line 503: 5 Exceptions: throws java.lang.InterruptedException class java.lang.ref.Reference$ReferenceHandler extends java.lang.Thread public void run(); descriptor: ()V flags: ACC_PUBLIC Code: stack=2, locals=5, args_size=1 0: invokestatic #62 // Method java/lang/ref/Reference.access$100:()Ljava/lang/ref/Reference$Lock; 3: dup 4: astore_2 5: monitorenter 6: invokestatic #61 // Method java/lang/ref/Reference.access$200:()Ljava/lang/ref/Reference; 9: ifnull33 12: invokestatic #61 // Method java/lang/ref/Reference.access$200:()Ljava/lang/ref/Reference; 15: astore_1 16: aload_1 17: invokestatic #64 // Method java/lang/ref/Reference.access$300:(Ljava/lang/ref/Reference;)Ljava/lang/ref/Reference; 20: invokestatic #63 // Method java/lang/ref/Reference.access$202:(Ljava/lang/ref/Reference;)Ljava/lang/ref/Reference; 23: pop 24: aload_1 25: aconst_null 26: invokestatic #65 // Method java/lang/ref/Reference.access$302:(Ljava/lang/ref/Reference;Ljava/lang/ref/Reference;)Ljava/lang/ref/Reference; 29: pop
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter the version you provided ran indefinitely(i put a 10 minute timeout) and the program got interrupted(no error), even if there were to be an error you cannot print the string of thread to console(these have been attempted earlier). - The test's running on interpreter mode, what i am watching for is one error with trace. Without fastdebug build and -XX:+TraceExceptions i am able to reproduce failure atleast 5 failures out of 1000 runs but with fastdebug+Trace no luck yet(already past few 1000 runs). --- Thanks kalyan On 01/10/2014 02:57 AM, Peter Levart wrote: On 01/10/2014 09:31 AM, Peter Levart wrote: Since we suspect there's something wrong with exception handling in interpreter, I devised a hypothetical reproducer that tries to simulate ReferenceHandler in many aspects, but doesn't require to be a ReferenceHandler: http://cr.openjdk.java.net/~plevart/misc/OOME/OOMECatchingTest.java This is designed to run indefinitely and only terminate if/when thread dies. Could you run this program in the environment that causes the OOMEInReferenceHandler test to fail and see if it terminates? I forgot to mention that in order for this long-running program to exhibit interpreter behaviour, it should be run with -Xint option. So I suggest: -Xmx24M -XX:-UseTLAB -Xint Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter/David, we could finally get a trace of exception with fastdebug build and ReferenceHandler modified (with runImpl() added and called from run()). The logs, disassembled code is available in JIRA https://bugs.openjdk.java.net/browse/JDK-8022321 as attachments. Observations from the log: Root Cause: 1) UncaughtException is being dispatched from Reference.java:143 141 ReferenceObject r; 142 synchronized (lock) { 143if (pending != null) { 144r = pending; 145pending = r.discovered; 146r.discovered = null; pending field in Reference is touched and updated by the collector, so at line 143 when the execution context is in Reference handler there might have been an Exception pending due to allocation done by collector which causes ReferenceHandler thread to die. Suggested fix: - As proposed earlier putting an outer guard(try-catch on OOME) in the ReferenceHandler will fix the issue, if ReferenceHandler is considered as part of the GC sub system then it should be alive even in the midst of an OOME so i feel that the additional guard should be allowed, however i might still be ignorant of vital implications. - Apart from the above changes, Peter's suggestion to create and call a private runImpl() from run() in ReferenceHandler makes sense to me. --- Thanks kalyan On 01/13/2014 03:57 PM, srikalyan wrote: On 1/11/14, 6:15 AM, Peter Levart wrote: On 01/10/2014 10:51 PM, srikalyan chandrashekar wrote: Hi Peter the version you provided ran indefinitely(i put a 10 minute timeout) and the program got interrupted(no error), Did you run it with or without fastedbug -XX:+TraceExceptions ? If with, it might be that fastdebug and/or -XX:+TraceExceptions changes the execution a bit so that we can no longer reproduce the wrong behaviour. With fastdebug -XX:TraceExceptions. I will try combination of possible options(i.e without -XX:TraceEception on debug build etc) soon. even if there were to be an error you cannot print the string of thread to console(these have been attempted earlier). ...it has been attempted to print toString in uncaught exception handler. At that time, the heap is still full. I'm printing it after the GC has cleared the heap. You can try that it works by commenting out the try { and corresponding } catch (OOME x) {} exception handler... Since there is a GC call prior to printing string i will give that a shot with non-debug build. - The test's running on interpreter mode, what i am watching for is one error with trace. Without fastdebug build and -XX:+TraceExceptions i am able to reproduce failure atleast 5 failures out of 1000 runs but with fastdebug+Trace no luck yet(already past few 1000 runs). It might be interesting to try with fastebug build but without the -XX:+TraceExceptions option to see what has an effect on it. It might also be interesting to try the modified ReferenceHandler (the one with private runImpl() method called from run()) and with normal non-fastdebug JDK. This info might be useful when one starts to inspect the exception handling code in interpreter... Regards, Peter -- Thanks kalyan Ph: (408)-585-8040 --- Thanks kalyan On 01/10/2014 02:57 AM, Peter Levart wrote: On 01/10/2014 09:31 AM, Peter Levart wrote: Since we suspect there's something wrong with exception handling in interpreter, I devised a hypothetical reproducer that tries to simulate ReferenceHandler in many aspects, but doesn't require to be a ReferenceHandler: http://cr.openjdk.java.net/~plevart/misc/OOME/OOMECatchingTest.java This is designed to run indefinitely and only terminate if/when thread dies. Could you run this program in the environment that causes the OOMEInReferenceHandler test to fail and see if it terminates? I forgot to mention that in order for this long-running program to exhibit interpreter behaviour, it should be run with -Xint option. So I suggest: -Xmx24M -XX:-UseTLAB -Xint Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi David, the disassembled code is also attached to the bug. Per my analysis the exception was thrown when Reference Handler was on line 143 as put in the earlier email. -- Thanks kalyan On 1/16/14 6:16 PM, David Holmes wrote: On 17/01/2014 4:48 AM, srikalyan wrote: Hi David On 1/15/14, 9:04 PM, David Holmes wrote: On 16/01/2014 10:19 AM, srikalyan chandrashekar wrote: Hi Peter/David, we could finally get a trace of exception with fastdebug build and ReferenceHandler modified (with runImpl() added and called from run()). The logs, disassembled code is available in JIRA https://bugs.openjdk.java.net/browse/JDK-8022321 as attachments. All I can see is the log for the OOMECatchingTest program not one for the actual ReferenceHandler ?? Please search for ReferenceHandler in the log. Observations from the log: Root Cause: 1) UncaughtException is being dispatched from Reference.java:143 141 ReferenceObject r; 142 synchronized (lock) { 143if (pending != null) { 144r = pending; 145pending = r.discovered; 146r.discovered = null; pending field in Reference is touched and updated by the collector, so at line 143 when the execution context is in Reference handler there might have been an Exception pending due to allocation done by collector which causes ReferenceHandler thread to die. Sorry but the GC does not trigger asynchronous exceptions so this explanation does not make any sense to me. What part of the log led you to this conclusion? -- Log Excerpt begins -- Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 168] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c600} 'runImpl' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 65 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c478} 'run' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 1 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddcaaf90} 'uncaughtException' '(Ljava/lang/Thread;Ljava/lang/Throwable;)V' in ' at bci 48 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddca7298} 'dispatchUncaughtException' '(Ljava/lang/Throwable;)V' in 'java/lang/ at bci 6 for thread 0x7feed80cf800 -- Log Excerpt ends -- Sorry if it is a wrong understanding. What you are seeing there is an OOME escaping the run() method which will cause the uncaughtExceptionHandler to be run which then triggers a second OOME (likely as it tries to report information about the first OOME). The first exception occurred in runImpl at BCI 65. Can you disassemble (javap -c) the class you used so we can see what is at BCI 65. Thanks, David Suggested fix: - As proposed earlier putting an outer guard(try-catch on OOME) in the ReferenceHandler will fix the issue, if ReferenceHandler is considered as part of the GC sub system then it should be alive even in the midst of an OOME so i feel that the additional guard should be allowed, however i might still be ignorant of vital implications. - Apart from the above changes, Peter's suggestion to create and call a private runImpl() from run() in ReferenceHandler makes sense to me. Why would we need this? David - --- Thanks kalyan On 01/13/2014 03:57 PM, srikalyan wrote: On 1/11/14, 6:15 AM, Peter Levart wrote: On 01/10/2014 10:51 PM, srikalyan chandrashekar wrote: Hi Peter the version you provided ran indefinitely(i put a 10 minute timeout) and the program got interrupted(no error), Did you run it with or without fastedbug -XX:+TraceExceptions ? If with, it might be that fastdebug and/or -XX:+TraceExceptions changes the execution a bit so that we can no longer reproduce the wrong behaviour. With fastdebug -XX:TraceExceptions. I will try combination of possible options(i.e without -XX:TraceEception on debug build etc) soon. even if there were to be an error you cannot print the string of thread to console(these have been attempted earlier). ...it has been attempted to print toString in uncaught exception handler. At that time, the heap is still full. I'm printing it after the GC has cleared the heap. You can try that it works
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
On 1/16/14 8:38 PM, David Holmes wrote: On 17/01/2014 1:31 PM, srikalyan chandrashekar wrote: Hi David, the disassembled code is also attached to the bug. Per my Sorry missed that. analysis the exception was thrown when Reference Handler was on line 143 as put in the earlier email. But if the numbers in the dissassembly match the BCI then 65 shows: 65: instanceof#11 // class sun/misc/Cleaner which makes more sense, the runtime instanceof check might encounter an OOME condition. I wish there was some easy way to trace into the full call chain as TraceExceptions doesn't show you any runtime frames :( Still, it is easy enough to check: // Fast path for cleaners boolean isCleaner = false; try { isCleaner = r instanceof Cleaner; } catch (OutofMemoryError oome) { continue; } Will get this into build and give a shot soon, in the log if you see bci 6 and bci 48 are where dispatch and uncaught exceptions are raised(please correct me if i am wrong), i assumed its from ReferenceHandler thread as it says the same thread Id 0x7feed80cf800. if (isCleaner) { ((Cleaner)r).clean(); continue; } Thanks, David -- Thanks kalyan On 1/16/14 6:16 PM, David Holmes wrote: On 17/01/2014 4:48 AM, srikalyan wrote: Hi David On 1/15/14, 9:04 PM, David Holmes wrote: On 16/01/2014 10:19 AM, srikalyan chandrashekar wrote: Hi Peter/David, we could finally get a trace of exception with fastdebug build and ReferenceHandler modified (with runImpl() added and called from run()). The logs, disassembled code is available in JIRA https://bugs.openjdk.java.net/browse/JDK-8022321 as attachments. All I can see is the log for the OOMECatchingTest program not one for the actual ReferenceHandler ?? Please search for ReferenceHandler in the log. Observations from the log: Root Cause: 1) UncaughtException is being dispatched from Reference.java:143 141 ReferenceObject r; 142 synchronized (lock) { 143if (pending != null) { 144r = pending; 145pending = r.discovered; 146r.discovered = null; pending field in Reference is touched and updated by the collector, so at line 143 when the execution context is in Reference handler there might have been an Exception pending due to allocation done by collector which causes ReferenceHandler thread to die. Sorry but the GC does not trigger asynchronous exceptions so this explanation does not make any sense to me. What part of the log led you to this conclusion? -- Log Excerpt begins -- Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 168] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c600} 'runImpl' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 65 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff7808e8) thrown in interpreter method {method} {0x7feeddd3c478} 'run' '()V' in 'java/lang/ref/Reference$ReferenceHandler' at bci 1 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown [/home/srikalyc/work/ora2013/infracleanup/jdk8/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp, line 157] for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddcaaf90} 'uncaughtException' '(Ljava/lang/Thread;Ljava/lang/Throwable;)V' in ' at bci 48 for thread 0x7feed80cf800 Exception a 'java/lang/OutOfMemoryError' (0xff780868) thrown in interpreter method {method} {0x7feeddca7298} 'dispatchUncaughtException' '(Ljava/lang/Throwable;)V' in 'java/lang/ at bci 6 for thread 0x7feed80cf800 -- Log Excerpt ends -- Sorry if it is a wrong understanding. What you are seeing there is an OOME escaping the run() method which will cause the uncaughtExceptionHandler to be run which then triggers a second OOME (likely as it tries to report information about the first OOME). The first exception occurred in runImpl at BCI 65. Can you disassemble (javap -c) the class you used so we can see what is at BCI 65. Thanks, David Suggested fix: - As proposed earlier putting an outer guard(try-catch on OOME) in the ReferenceHandler will fix the issue, if ReferenceHandler is considered as part of the GC sub system then it should be alive even in the midst of an OOME so i feel that the additional guard should be allowed, however i might still be ignorant of vital implications. - Apart from the above changes, Peter's suggestion to create and call a private runImpl() from run() in ReferenceHandler makes sense to me. Why
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi David, yes thats right, only benefit i see is we can avoid assignment to 'r' if pending is null. -- Thanks kalyan On 1/23/14 4:33 PM, David Holmes wrote: On 24/01/2014 6:10 AM, srikalyan wrote: Hi Peter, i have modified your code from r = pending; if (r != null) { .. TO if (pending != null) { r = pending; This is because the r is used later in the code and must not be assigned pending unless it is not null(this was as is earlier). If r is null, because pending is null then you perform the wait() and then continue - back to the top of the loop. There is no bug in Peter's code. The new webrev is posted here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev-V2/ . I ran a 1000 run and no failures so far, however i would like to run a couple more 1000 runs to assert the fix. PS: The description section of JEP-122 (http://openjdk.java.net/jeps/122) says meta-data would be in native memory(not heap). The class_mirror is a Java object not meta-data. David -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 2:31 PM, Peter Levart wrote: On 01/21/2014 07:17 PM, srikalyan wrote: Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? The perm gen is not a problem her (JDK 8 does not have it and we see OOME on JDK8 too). Each time a class is loaded, new java.lang.Class object is allocated on heap. Regards, Peter Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
Hi Peter, if you are a committer would you like to take this further (OR) perhaps david could sponsor this change. -- Thanks kalyan On 1/24/14 4:05 PM, Peter Levart wrote: On 01/24/2014 02:53 AM, srikalyan chandrashekar wrote: Hi David, yes thats right, only benefit i see is we can avoid assignment to 'r' if pending is null. Hi Kalyan, Good to hear that test runs without failures so far. Regarding assignment of 'r'. What I tried to accomplish with the change was eliminate double reading of 'pending' field. I have a mental model of local variable being a register and field being a memory location. This may be important if the field is volatile, but for normal fields, I guess the optimizer knows how to compile such code most optimally in either case. The old (your) version is better from logical perspective, since it guarantees that dereferencing the 'r', wherever it is possible, will never throw NPE (dereferencing where 'r' is not assigned is not possible because of definitive assignment rules). So I support going back to your version... Regards, Peter -- Thanks kalyan On 1/23/14 4:33 PM, David Holmes wrote: On 24/01/2014 6:10 AM, srikalyan wrote: Hi Peter, i have modified your code from r = pending; if (r != null) { .. TO if (pending != null) { r = pending; This is because the r is used later in the code and must not be assigned pending unless it is not null(this was as is earlier). If r is null, because pending is null then you perform the wait() and then continue - back to the top of the loop. There is no bug in Peter's code. The new webrev is posted here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev-V2/ . I ran a 1000 run and no failures so far, however i would like to run a couple more 1000 runs to assert the fix. PS: The description section of JEP-122 (http://openjdk.java.net/jeps/122) says meta-data would be in native memory(not heap). The class_mirror is a Java object not meta-data. David -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 2:31 PM, Peter Levart wrote: On 01/21/2014 07:17 PM, srikalyan wrote: Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? The perm gen is not a problem her (JDK 8 does not have it and we see OOME on JDK8 too). Each time a class is loaded, new java.lang.Class object is allocated on heap. Regards, Peter Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter
Re: Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently
On 1/26/14 11:07 AM, Peter Levart wrote: On 01/25/2014 05:35 AM, srikalyan chandrashekar wrote: Hi Peter, if you are a committer would you like to take this further (OR) perhaps david could sponsor this change. Hi, Here's new webrev that takes into account Kaylan's and David's review comments: cr.openjdk.java.net/~plevart/jdk9-dev/OOMEInReferenceHandler/webrev.02/ I changed into using Class.forName() instead of Unsafe for class preloading and initialization just to be on the safe side regarding unwanted premature initialization of Unsafe class. I also took the liberty of removing an unneeded semicolon (line 114) and fixing a JDK 8 compile time error in generics (line 189): incompatible types: java.lang.ref.ReferenceQueuecapture#1 of ? super java.lang.Object cannot be converted to java.lang.ref.ReferenceQueuejava.lang.Object I re-ran the java/lang/ref tests and they pass. Can I count you as a reviewer, Kalyan? If I get a go also from David, I'll commit this to jdk9/dev... Hi Peter, I do not have review rights. So it has to be someone else from core-libs-dev. Regards, Peter -- Thanks kalyan -- Thanks kalyan On 1/24/14 4:05 PM, Peter Levart wrote: On 01/24/2014 02:53 AM, srikalyan chandrashekar wrote: Hi David, yes thats right, only benefit i see is we can avoid assignment to 'r' if pending is null. Hi Kalyan, Good to hear that test runs without failures so far. Regarding assignment of 'r'. What I tried to accomplish with the change was eliminate double reading of 'pending' field. I have a mental model of local variable being a register and field being a memory location. This may be important if the field is volatile, but for normal fields, I guess the optimizer knows how to compile such code most optimally in either case. The old (your) version is better from logical perspective, since it guarantees that dereferencing the 'r', wherever it is possible, will never throw NPE (dereferencing where 'r' is not assigned is not possible because of definitive assignment rules). So I support going back to your version... Regards, Peter -- Thanks kalyan On 1/23/14 4:33 PM, David Holmes wrote: On 24/01/2014 6:10 AM, srikalyan wrote: Hi Peter, i have modified your code from r = pending; if (r != null) { .. TO if (pending != null) { r = pending; This is because the r is used later in the code and must not be assigned pending unless it is not null(this was as is earlier). If r is null, because pending is null then you perform the wait() and then continue - back to the top of the loop. There is no bug in Peter's code. The new webrev is posted here http://cr.openjdk.java.net/~srikchan/Regression/JDK-8022321_OOMEInReferenceHandler-webrev-V2/ . I ran a 1000 run and no failures so far, however i would like to run a couple more 1000 runs to assert the fix. PS: The description section of JEP-122 (http://openjdk.java.net/jeps/122) says meta-data would be in native memory(not heap). The class_mirror is a Java object not meta-data. David -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 2:31 PM, Peter Levart wrote: On 01/21/2014 07:17 PM, srikalyan wrote: Hi Peter/David, catching up after long weekend. Why would there be an OOME in object heap due to class loading in perm gen space ? The perm gen is not a problem her (JDK 8 does not have it and we see OOME on JDK8 too). Each time a class is loaded, new java.lang.Class object is allocated on heap. Regards, Peter Please correct if i am missing something here. Meanwhile i will give the version of Reference Handler you both agreed on a try. -- Thanks kalyan Ph: (408)-585-8040 On 1/21/14, 7:24 AM, Peter Levart wrote: On 01/21/2014 07:54 AM, Peter Levart wrote: *[Loaded sun.misc.Cleaner from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar]* [Loaded java.io.ByteArrayInputStream from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] [Loaded sun.util.calendar.ZoneInfoFile$ZoneOffsetTransitionRule from /home/peter/Apps64/jdk1.8.0-ea-b121/jre/lib/rt.jar] ... I'm on linux, 64bit and using official EA build 121 of JDK 8... But if I try with JDK 7u45, I don't see it. So what changed between JDK 7 and JDK 8? I suspect the following: 8007572: Replace existing jdk timezone data at java.home/lib/zi with JSR310's tzdb Regards, Peter