Hi everybody, I've analyzed this issue on AIX (and HPUX) and came to the conclusion that it is not a problem on these platforms.
Just to make sure I got everything right, I'll summarize my understanding of the problem here: On MacOS, setjmp() saves the signal mask of the current thread BUT restores the signal mask of the whole process after the corresponding longjmp(). This obviously changes the signal mask of all threads which had a different signal mask than the initial thread who called setjmp(). I could easily reproduce this by loading a class which provokes a verification exception in the un-fixed libverify.so. Afterwards, the VM-thread won't listen to SIGQUIT anymore because it now masks this signal in the same way like the Java thread which had the verification exception. On AIX/HPUX setjmp()/longjmp() is defined in the same way like on MacOS in that it saves and restores the signal mask (and both platforms provide _setjmp()/_longjmp() versions as well which do not manipulate the signal mask). But while it is not clear from the documentation, all my tests showed that longjmp() only restores the signal mask of the current thread, which should be OK (i.e. I couldn't see a blocked VM thread which doesn't react on SIQUIT anymore). So to cut a long story short - no action seems to be required to fix this issue on AIX: As a side note I wonder why we didn't use sigsetjmp()/siglongjmp() with a zero 'savemask' argument to fix this problem? It is standardized by POSIX and would avoid the usage of "#ifdef __APPLE". Thank you and best regards, Volker On Fri, May 16, 2014 at 6:04 PM, David DeHaven <david.deha...@oracle.com> wrote: > > I'd thought about that, since Apple borrowed most of it's underpinnings from > BSD, but had no evidence to suggest it was necessary. > > Anyways, at least you've identified and can rectify the issue :) > > -DrD- > >> Wow, sometimes it really makes sense to read apparently unrelated >> email-threads on Friday afternoon:) >> >> It seems that AIX/HPUX have exactly the same problem like MacOS X. >> From the AIX setjmp man-page: >> >> "The setjmp and longjmp subroutines save and restore the signal mask >> sigmask (2), while _setjmp and _longjmp manipulate only the stack >> context." >> >> From the HPUXM setjmp man-page: >> >> "setjmp(), longjmp() These always save and restore the signal mask. >> _setjmp(), _longjmp() These never manipulate the signal mask." >> >> I think it doesn't make sense for you to wait pushing this until I >> provide the corresponding AIX patches because anyway I'll have to >> provide a fix not only for this issue but also for the other bugs you >> mentioned (i.e. 8023786 and 8023720). So I'll better create a new bug >> for AIX. >> >> Thank you and best regards, >> Volker >> >> On Fri, May 16, 2014 at 5:38 PM, David DeHaven <david.deha...@oracle.com> >> wrote: >>> >>> Thanks! >>> >>> -DrD- >>> >>>> The splashscreen changes look fine to me. Approved. >>>> >>>> -- >>>> best regards, >>>> Anthony >>>> >>>> On 5/16/2014 7:18 PM, David DeHaven wrote: >>>>> >>>>> Could someone on AWT team approve the splashscreen changes? >>>>> >>>>> -DrD- >>>>> >>>>>> Approved. >>>>>> >>>>>> -phil. >>>>>> >>>>>> On 5/15/2014 9:31 AM, David DeHaven wrote: >>>>>>> Ping! >>>>>>> >>>>>>> Does this look OK? >>>>>>> >>>>>>> I've also filed an issue against JavaFX: >>>>>>> https://javafx-jira.kenai.com/browse/RT-37125 >>>>>>> >>>>>>> -DrD- >>>>>>> >>>>>>>>>>> I tried not modifying libpng but still ended up with lingering >>>>>>>>>>> references to longjmp in pngread.o, despite libpng having >>>>>>>>>>> png_ptr->longjmp_fn (bug in libpng?). pngread.c calls setjmp to set >>>>>>>>>>> a default location to jump to in case the caller doesn't call >>>>>>>>>>> setjmp, so if we continue down this path something in libpng must >>>>>>>>>>> be modified. The only other option is to create our own setjmp.h >>>>>>>>>>> and order it before /usr/include/setjmp.h, which seems dubious at >>>>>>>>>>> best. >>>>>>>>>>> >>>>>>>>>>> I'm curious if the libpng changes are even needed since it's only >>>>>>>>>>> used for splashscreen, which happens very early in the launch >>>>>>>>>>> process. Also note that we didn't originally even call >>>>>>>>>>> png_set_longjmp_fn, so any error should have resulted in an abort() >>>>>>>>>>> instead of a call to longjmp... it appears we could retain the >>>>>>>>>>> functionality we have today and #undef PNG_SETJMP_SUPPORTED >>>>>>>>>>> (pngconf.h?). That would put the onus on developers to make sure >>>>>>>>>>> their pngs don't have errors in them, or libsplashscreen will >>>>>>>>>>> abort()... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> That's an interesting question and the answer might extend to the >>>>>>>>>> splashscreen changes too. >>>>>>>>>> Its platform specific code and on MAC, the thread is created using >>>>>>>>>> pthreads directly and that >>>>>>>>>> thread goes away once splashscreen is done. But its running at the >>>>>>>>>> same time as the VM >>>>>>>>>> is booting up and creating threads and setting their signal masks. >>>>>>>>>> So I don't think you can >>>>>>>>>> guarantee that it won't mess up the masks on the JRE threads if the >>>>>>>>>> PNG is bad. And I'm >>>>>>>>>> also not sure you want to remove error handling from the library >>>>>>>>>> either. >>>>>>>>>> So a HIGHLY VISIBLE DO NOT REMOVE comment might be the best you can >>>>>>>>>> do here. >>>>>>>>> I have a better idea: >>>>>>>>> >>>>>>>>> png_default_error is the only place where png_longjmp is called. We >>>>>>>>> could call png_set_error_fn to set up our own error handler (for Mac >>>>>>>>> only), compile with PNG_SETJMP_SUPPORTED unset so it doesn't pull in >>>>>>>>> setjmp/longjmp and our own implementation of the error handler would >>>>>>>>> call _longjmp, which would jump back to where we call setjmp >>>>>>>>> currently. >>>>>>>> Ok, I figured out what's going on. It's not quite intuitive... >>>>>>>> >>>>>>>> png_jmpbuf is a macro defined in png.h, this calls png_set_longjmp_fn >>>>>>>> with longjmp, which is why I was seeing references to longjmp in the >>>>>>>> object file. That's what was throwing me off as it seems like it >>>>>>>> should only be getting the jmp_buf ptr stored in the png_ptr. I guess >>>>>>>> the intention was that setjmp/longjmp was optional, if you don't call >>>>>>>> setjmp then it just abort()s. >>>>>>>> >>>>>>>> >>>>>>>> I changed splashscreen_png.c to: >>>>>>>> #ifdef __APPLE__ >>>>>>>> if (_setjmp(png_set_longjmp_fn(png_ptr, _longjmp, sizeof(jmp_buf)))) >>>>>>>> { >>>>>>>> #else >>>>>>>> if (setjmp(png_jmpbuf(png_ptr))) { >>>>>>>> #endif >>>>>>>> >>>>>>>> and it calls _longjmp instead. I verified this works by changing the >>>>>>>> macro to set png_longjmp to exit() and without the above change it >>>>>>>> does indeed exit prematurely with a bad png, with the change it >>>>>>>> reports the error but continues to load the application as would be >>>>>>>> expected. >>>>>>>> >>>>>>>> pngread.o still has a symbol table entry for _longjmp instead of >>>>>>>> __longjmp, but it's benign since we're ultimately forcing it to use >>>>>>>> the correct function. So I've left libpng completely unchanged. >>>>>>>> >>>>>>>> >>>>>>>> With the change and using a bad png for splashscreen, I was able to >>>>>>>> get a stack trace once the application was running. Without the change >>>>>>>> to splashscreen_png.c, jstack was unable to connect to the process. So >>>>>>>> splashscreen absolutely can interfere with the signal handling. >>>>>>>> >>>>>>>> >>>>>>>> Updated webrev: >>>>>>>> http://cr.openjdk.java.net/~ddehaven/8026385/jdk.1/ >>>>>>>> >>>>>>>> I can look into writing a regression test for this. It might not be >>>>>>>> trivial though since we're dealing with signal handlers, and if timing >>>>>>>> is a factor the test may not be reliable. >>>>>>>> >>>>>>>> -DrD- >>>>>>>> >>>>>> >>>>> >>> >