Still Draft, pls ignore for now. Patch is not done yet.

This patch enables hs-err file generation for native out-of-stack cases. It is 
an optional analysis feature one can use when JVMs mysteriously vanish - 
typically, vanishing JVMs are either native stack overflows or OOM kills.

This was motivated by the analysis difficulties of bugs like 
https://bugs.openjdk.org/browse/JDK-8371630. There are many more examples.

### Motivation

Today, when native stack overflows, the JVM dies immediately without an hs-err 
file. This is because C++-compiled code does not bang - if the stack is too 
small, we walk right into whatever caps the stack. That might be our own 
yellow/red guard pages, native guard pages placed by libc or kernel, or 
possibly unmapped area after the end of the stack. 

Since we don't have a stack left to run the signal handler on, we cannot 
produce the hs-err file. If one is very lucky, the libc writes a short "Stack 
overflow" to stderr. But usually not: if it is a JavaThread and we run into our 
own yellow/red pages, it counts as a simple segmentation fault from the OS's 
point of view, since the fault address is inside of what it thinks is a valid 
pthread stack. So, typically, you just see "Segmentation fault" on stderr.

***Why do we need this patch? Don't we bang enough space for native code we 
call?***

We bang when entering a native function from Java. The maximum stack size we 
assume at that time might not be enough; moreover, the native code may be buggy 
or just too deeply or infinitely recursive. 

***We could just increase `ShadowPages`, right?***

Sure, but the point is we have no hs-err file, so we don't even know it was a 
stack overflow. One would have to start debugging, which is work-intensive and 
may not even be possible in a customer scenario. And for buggy recursive code, 
any `ShadowPages` value might be too small. The code would need to be fixed.

### Implementation

The patch uses alternative signal stacks. That is a simple, robust solution 
with few moving parts. It works out of the box for all cases: 
- Stack overflows inside native JNI code from Java 
- Stack overflows inside Hotspot-internal JavaThread children (e.g. 
CompilerThread, AttachListenerThread etc)
- Stack overflows in non-Java threads (e.g. VMThread, ConcurrentGCThread)
- Stack overflows in outside threads that are attached to the JVM, e.g. 
third-party JVMTI threads

The drawback of this simplicity is that it is not suitable for always-on 
production use. That is due to the added footprint costs of alternative stacks: 
every Java thread is almost guaranteed to hit the signal handler and thus use 
its signal stack during normal operation (eg. polling page accesses). So, some 
pages of their signal stacks will always be paged in. That increases the cost 
per thread by a few pages.

#### Signal processing flow with this patch:

1) JavaThread:
  - Runs into the yellow page (this is where we die today)
  - Switches to the alternative stack and enters the signal handler
  - Disable the yellow page.
  - returns from signal handler, continue native function on main stack
  - Runs into the red page, enters signal handling again on alternative stack
  - Prints "An irrecoverable stack overflow has occurred."
  - Generates hs-err file
 
2)  NonJavaThread:
   - Runs into native guard page or unmapped area after the stack (this is 
where we die today)
   - enters signal handling on alternative stack
   - Generates hs-err file

***Will running the error handler on a separate callstack not mess up the 
printed call stack in hs-err files?***

No. In VMError::report(), we use the `ucontext_t` handed to us by the kernel 
for the *initial crash* that happened on the *original stack*.  Also, any 
follow-up (secondary) errors during signal handling continue to use that 
context. 

***What happens when the signal stack itself is too small?***

Then we die :-). Hopefully, by that time, we have produced enough hs-err 
content to analyze the problem. Note that the signal stack size can be 
increased via (`-XX:AltSigStackSize`).

***But we use signal handling for non-error conditions, e.g. implicit null 
checks? Does that still work?***

Yes. No code that runs during non-error signal handling assumes the same stack.

***And during error handling?***

I found only one place that assumes we run the signal handler on the main 
stack, which is the constructor of `methodHandle`. That affects error reporting 
only very slightly when attempting to print code blob attributes. 

It could be fixed, but I left this as a potential fix for later to keep this 
patch small.

### Testing

- New regression tests test handling native stack overflows for: 
   - java threads invoking JNI code (see NativeStackoverflowTest.java)
   - hotspot-internal threads `JavaThread` threads and `NonJavaThread` threads 
(see gtest).
- I enabled the feature by default and ran `hotspot` `tier1`, and `tier2`, and 
SAP did run that through their CI/CD. No problems surfaced.
- GHAs ran with the feature enabled by default and disabled.
- I ran a massive multithreaded test for hours, with a lot of thread creation 
churn, to observe whether we accidentally leak stacks

-------------

Commit messages:
 - fix gtests
 - default-off
 - reduce diff
 - copyrights
 - wip
 - wip
 - fixes
 - fix linux build
 - macos fixes
 - fixes
 - ... and 9 more: https://git.openjdk.org/jdk/compare/d62b9f78...fb11777f

Changes: https://git.openjdk.org/jdk/pull/29559/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29559&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8373128
  Stats: 553 lines in 31 files changed: 521 ins; 6 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/29559.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/29559/head:pull/29559

PR: https://git.openjdk.org/jdk/pull/29559

Reply via email to