Still Draft, pls ignore for now. Patch is not done yet. This patch enables hs-err file generation for native out-of-stack cases. It is an optional analysis feature one can use when JVMs mysteriously vanish - typically, vanishing JVMs are either native stack overflows or OOM kills.
This was motivated by the analysis difficulties of bugs like https://bugs.openjdk.org/browse/JDK-8371630. There are many more examples. ### Motivation Today, when native stack overflows, the JVM dies immediately without an hs-err file. This is because C++-compiled code does not bang - if the stack is too small, we walk right into whatever caps the stack. That might be our own yellow/red guard pages, native guard pages placed by libc or kernel, or possibly unmapped area after the end of the stack. Since we don't have a stack left to run the signal handler on, we cannot produce the hs-err file. If one is very lucky, the libc writes a short "Stack overflow" to stderr. But usually not: if it is a JavaThread and we run into our own yellow/red pages, it counts as a simple segmentation fault from the OS's point of view, since the fault address is inside of what it thinks is a valid pthread stack. So, typically, you just see "Segmentation fault" on stderr. ***Why do we need this patch? Don't we bang enough space for native code we call?*** We bang when entering a native function from Java. The maximum stack size we assume at that time might not be enough; moreover, the native code may be buggy or just too deeply or infinitely recursive. ***We could just increase `ShadowPages`, right?*** Sure, but the point is we have no hs-err file, so we don't even know it was a stack overflow. One would have to start debugging, which is work-intensive and may not even be possible in a customer scenario. And for buggy recursive code, any `ShadowPages` value might be too small. The code would need to be fixed. ### Implementation The patch uses alternative signal stacks. That is a simple, robust solution with few moving parts. It works out of the box for all cases: - Stack overflows inside native JNI code from Java - Stack overflows inside Hotspot-internal JavaThread children (e.g. CompilerThread, AttachListenerThread etc) - Stack overflows in non-Java threads (e.g. VMThread, ConcurrentGCThread) - Stack overflows in outside threads that are attached to the JVM, e.g. third-party JVMTI threads The drawback of this simplicity is that it is not suitable for always-on production use. That is due to the added footprint costs of alternative stacks: every Java thread is almost guaranteed to hit the signal handler and thus use its signal stack during normal operation (eg. polling page accesses). So, some pages of their signal stacks will always be paged in. That increases the cost per thread by a few pages. #### Signal processing flow with this patch: 1) JavaThread: - Runs into the yellow page (this is where we die today) - Switches to the alternative stack and enters the signal handler - Disable the yellow page. - returns from signal handler, continue native function on main stack - Runs into the red page, enters signal handling again on alternative stack - Prints "An irrecoverable stack overflow has occurred." - Generates hs-err file 2) NonJavaThread: - Runs into native guard page or unmapped area after the stack (this is where we die today) - enters signal handling on alternative stack - Generates hs-err file ***Will running the error handler on a separate callstack not mess up the printed call stack in hs-err files?*** No. In VMError::report(), we use the `ucontext_t` handed to us by the kernel for the *initial crash* that happened on the *original stack*. Also, any follow-up (secondary) errors during signal handling continue to use that context. ***What happens when the signal stack itself is too small?*** Then we die :-). Hopefully, by that time, we have produced enough hs-err content to analyze the problem. Note that the signal stack size can be increased via (`-XX:AltSigStackSize`). ***But we use signal handling for non-error conditions, e.g. implicit null checks? Does that still work?*** Yes. No code that runs during non-error signal handling assumes the same stack. ***And during error handling?*** I found only one place that assumes we run the signal handler on the main stack, which is the constructor of `methodHandle`. That affects error reporting only very slightly when attempting to print code blob attributes. It could be fixed, but I left this as a potential fix for later to keep this patch small. ### Testing - New regression tests test handling native stack overflows for: - java threads invoking JNI code (see NativeStackoverflowTest.java) - hotspot-internal threads `JavaThread` threads and `NonJavaThread` threads (see gtest). - I enabled the feature by default and ran `hotspot` `tier1`, and `tier2`, and SAP did run that through their CI/CD. No problems surfaced. - GHAs ran with the feature enabled by default and disabled. - I ran a massive multithreaded test for hours, with a lot of thread creation churn, to observe whether we accidentally leak stacks ------------- Commit messages: - fix gtests - default-off - reduce diff - copyrights - wip - wip - fixes - fix linux build - macos fixes - fixes - ... and 9 more: https://git.openjdk.org/jdk/compare/d62b9f78...fb11777f Changes: https://git.openjdk.org/jdk/pull/29559/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29559&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373128 Stats: 553 lines in 31 files changed: 521 ins; 6 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/29559.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29559/head:pull/29559 PR: https://git.openjdk.org/jdk/pull/29559
