Hi Thomas,

Some initial comments as this is quite big - but thanks for all the detailed explanations.

On 4/11/2020 1:57 am, Thomas Stuefe wrote:
Hi all,

may I please have opinions and reviews on this cleanup-fix-patch for the 
hotspot signal handlers.

Its main intent is to simplify coding and to commonize some of it across all 
Posix platforms where possible. Also to fix a number of smaller issues.

This will have a number of benefits, mainly easing maintenance pain for porters 
and reducing bitrot for platform dependent code.

This all builds upon the work @gerard-ziemski did with 
https://bugs.openjdk.java.net/browse/JDK-8252324.

----

This cleanup was made more complicated by the fact that there exists a 
non-obvious and undocumented way for a third party app to chain signal handlers 
(beside the documented one of using libjsig). It seems that the 
JVM_handle_xxx_functions() are in fact interfaces for third party coding to 
invoke hotspot signal handling. This only makes sense in combination with 
`-XX:+AllowUserSignalHandlers`. A cursory github search revealed that this flag 
is used quite a bit.

Thanks for doing that search.

See a more in-depth discussion here: [4]. Thanks to @dholmes-ora for untangling 
this bit of history.

Unfortunately there is no official documentation from Sun or Oracle, and zero 
regression tests. So I try to preserve this interface as best as I can. I plan 
to add a proper regression test with a later change, but for now I don't have 
the time for that.

On the documentation front we could at least explain what the flag really does. Rather than saying:

product(bool, AllowUserSignalHandlers, false, \ "Do not complain if the application installs signal handlers "

we could say something like:

product(bool, AllowUserSignalHandlers, false, \ "Allow the application to install the primary signal handlers instead of the JVM." \

and we (I?) could update the java manpage.


---

The fixed issues:

1) `PosixSignals::_are_signal_handlers_installed` is used inside the platform 
handlers to guard a part of the platform handlers against execution in case the 
signal handlers are not yet installed.

Initially this confused me since when this handler is called it would of course 
be installed. So that boolean would always be true. The only explanation I 
found was that since these handlers can be invoked directly from outside, this 
is some (ineffective) form of guard against calling this handler too early.
But that guard can be left out and that boolean removed. Our signal handlers 
are safe to call before VM initialization is completed.

The code that is guarded by this check is implicitly safe as it is trivial and early in VM init the current thread will be null anyway and so we won't execute the guarded code.

2) The return code of JVM_handle_xxx_signal() was inconsistently set (some as 
bool, some as int) as well as unused in normal code paths (excluding outside 
calls).

Ok.

3) JVM_handle_xxx_signal are supposed to be exported, but on AIX there is a 
day-zero bug which caused it to not be exported.

4) Every platform handler has this section:

   JavaThread* thread = NULL;
   VMThread* vmthread = NULL;
   if (PosixSignals::are_signal_handlers_installed()) {
     if (t != NULL ){
       if(t->is_Java_thread()) {
         thread = t->as_Java_thread();
       }
       else if(t->is_VM_thread()){
         vmthread = (VMThread *)t;
       }
     }
   }

`vmthread` is unused on all platforms and can be removed.

Ok. Once upon a time we probably did something different if in the vmThread versus some other non-JavaThread.

5) Every platform handler has some variant of this section, to ignore SIGPIPE, 
SIGXFSZ (whose default actions are to terminate the VM):

   if (sig == SIGPIPE || sig == SIGXFSZ) {
     // allow chained handler to go first
     if (PosixSignals::chained_handler(sig, info, ucVoid)) {
       return true;
     } else {
       // Ignoring SIGPIPE/SIGXFSZ - see bugs 4229104 or 6499219
       return true;
     }
   }

- On s390 and ppc, we miss SIGXFSZ handling
    _Update: Fixed separately for easier backport, see 
[https://bugs.openjdk.java.net/browse/JDK-8255734](JDK-8255734)_.
- both paths return true - section can be shortened

Side note: having handlers for those signals may be unnecessary. We could just 
set the signal handler to `SIG_IGN`. We would have to tiptoe around any third 
party handlers for those signals, but it still may be simpler.

6) At the end of every platform header, before calling into fatal error 
handling, we unblock the signal:

  // unmask current signal
  sigset_t newset;
  sigemptyset(&newset);
  sigaddset(&newset, sig);
  sigprocmask(SIG_UNBLOCK, &newset, NULL);


- Use of `sigprocmask()` is UB in a multithreaded program.
- but then, this section is unnecessary anyway, since 
[https://bugs.openjdk.java.net/browse/JDK-8252533](JDK-8252533) we unmask error 
signals at the start of the signal handler.

But is this guaranteed to be one of the error signals? What if the application calls our handler for some other signal? I guess that is their problem.

7) the JFR crash protection is not consistently checked in all platform 
handlers.

8) On Zero, when entering fatal error handling, we do so via fatal() instead of 
VMError::report_and_die(), thereby discarding the real crash context and 
obfuscating the register content in the hs-err file (we still see registers, 
but those stem from the assertion-poison-page mechanism).

9) on Linux ppc64 and AIX, we have this section:

      if (sig == SIGILL && (pc < (address) 0x200)) {
        goto report_and_die;
      }

which is related to the fact that the zero page on AIX is readable, filled with 
0, and reading instructions from it will yield us a SIGILL, not a SIGSEGV (0 is 
not a noop on PPC, so we don't nop-slide).

This coding is irrelevant on Linux. On AIX, it can also be removed, since this 
SIGILL would be unrecognized by the hotspot and later count as fatal error 
anyway.

10) When invoking the fatal error handler, we extract the pc from the context and hand it 
over as "faulting pc". For SIGILL and SIGFPE, this is not totally correct. 
According to POSIX [3], for those signals the address of the faulting instruction is 
handed over in `si_info.si_addr`.

On most platforms this does not matter, they are the same. But on some 
architectures the pc in the signal context actually points somewhere else, e.g. 
beyond the faulting instruction. Therefore `si_info.si_addr` is the better 
choice.

The Posix spec also states "For some implementations, the value of si_addr may be inaccurate." - so I'm not at all sure which "pc" we should be trusting here? I thought the ucontext was the detailed platform specific "context" object that we should extract information from. Which architectures give different values in the two and is there some documentation stating what happens for any given os/cpu?

----

The changes in this patch:

a) hotspot signal handling is now done by the following functions:

<OS>                 <foreign code, probably signal handler too>
   |                           |
   v                           v
  javaSignalHandler       JVM_handle_linux_signal()
        |                   /
        v                  v
      javaSignalHandler_inner

Not clear why we need the _inner version. Why can't we just have javaSignalHandler which is installed as the handler and which is called by JVM_handle_XXX_signal?

                |
                v
      PosixSignals::pd_hotspot_signal_handler()


The right branch only exists to support the `AllowUserSignalHandlers` case, see 
[4].

`javaSignalHandler` is registered as handler, as it was before; 
JVM_handle_linux_signal() is exported as before.
`javaSignalHandler_inner` contains the shared portion of the signal handler; 
`PosixSignals::pd_hotspot_signal_handler()` contains the remaining platform 
dependent portions.


b) I commonized prologue- and epilogue coding.
   - I simplified (4) to a single line in the shared handler
   - I moved the JFR thread crash protection (7) up to the shared handler
   - I moved the complete epilogue up to the shared handler. That includes 
calling the chained handlers, should they exist, as well as invoking the fatal 
error handler. That fixes (8) and (6)
   - Zero has this tradition of showing a robot telling a cat about the error 
signal, which I like, and kept.
   - I simplified (5) and commonized it, and removed (9) completely
   - In PosixSignals::install_signal_handlers(), I removed the 
`signal_handlers_are_installed` guard and replaced it with an assert. 
Unfortunately this causes lots of indentation changes. @gerard-ziemski: if this 
clashes too much with your patch for JDK-8253742, I'll leave that part out.

Thanks for reviewing.

Testing: this patch ran through our nightlies, in an earlier form. They will be 
re-ran some more times.

I'd be happy if aarch64 porters could take a look at the aarch64 portion of 
this change.

Please note that I had to draw a line somewhere - this is an open ended issue 
and a lot more could be cleaned. See also Gerard's work on [5], which is under 
review too.

:) I've actually gone through this in far more detail as I've been composing this email and overall it is looking very good. I've made a couple of comments directly in the PR, in addition to the above.

Thanks,
David
-----


----

[1] 
https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-October/043145.html
[2] 
https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-October/043191.html
[3] https://pubs.opengroup.org/onlinepubs/009695399/basedefs/signal.h.html
[4] https://mail.openjdk.java.net/pipermail/jdk-dev/2020-November/004887.html
[5] https://bugs.openjdk.java.net/browse/JDK-8253742

-------------

Commit messages:
  - Initial patch

Changes: https://git.openjdk.java.net/jdk/pull/1034/files
  Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1034&range=00
   Issue: https://bugs.openjdk.java.net/browse/JDK-8255711
   Stats: 916 lines in 13 files changed: 116 ins; 686 del; 114 mod
   Patch: https://git.openjdk.java.net/jdk/pull/1034.diff
   Fetch: git fetch https://git.openjdk.java.net/jdk pull/1034/head:pull/1034

PR: https://git.openjdk.java.net/jdk/pull/1034

Reply via email to