While working on a patch for https://bugs.openjdk.org/browse/JDK-8377907, we 
saw that error analysis in this area can take an unreasonable amount of time, 
especially for cases that are not directly reproducible on a developer machine.

This is because if an error occurs in the child process between the fork and 
exec stages (FORK mode) or between the first and second exec stages 
(POSIX_SPAWN mode), the ways in which we can communicate the errors are quite 
limited: standard IO may not yet work, and there is no logging. All we have is 
a single 32-bit value that we send back to the parent.

Today, this 32-bit value contains errno or a handful of our own error codes 
(which even overlap with errno values).

Following an idea by @RogerRiggs, this patch enhances the returned code by 
encoding more information:

1) An 8-bit step number: this shows us the exact step at which the child 
program encountered an error.
2) An optional 8-bit errno: like today, but only set for errors that are 
directly associated with an API call
3) An optional 16-bit "hint": Free-form information whose meaning depends on 
the step.

The system is clean and easily expandable with more detailed error information, 
should we need it. We can also bump the error code to 64-bit to get more 
encoding space, but I left it at 32-bit for now.

------

Error handling:

When an error occurs, we now attempt to send the error code back to the parent 
as before, but if that fails (e.g., because the fail pipe was not established), 
we also print the error code and exit the child process with an exit code 
corresponding to the step number.

Where we print the error code, we use the form `(<step>-<hint>-<errno>)`. For 
example, "(2-0-0)" is a jspawnhelper version mismatch. "(3-17-9)" means that in 
step 3 (early jspawnhelper setup), we found that file descriptor 17 was invalid 
(9=EBADF). We only do this for internal errors that are meant to be read by us, 
not by the end user.

As a by-product of this patch, we now get higher error granularity for some 
user-facing errors. E.g., if the caller handed in an invalid working directory, 
the IOE exception message now reads "Invalid working directory" instead of the 
generic "Exec failed".

------

Implementation notes:

- The "CHILD_ALIVE" ping was changed to be a special errcode with a special 
step, `ESTEP_CHILD_ALIVE`; but it works the same.
- Where functions were changed to return error codes in case of an error, I 
always followed the same scheme. See comment in childproc.c.
- I started to adopt `bool` in functions I changed, since I think it reads 
cleaner. Apart from that, I kept changes as small as possible.

-----

Testing:
- I manually ran tests on Linux x64 (glibc and muslc), MacOS arm64 and AIX.
- GHAs ran successfully


Note: I intend to push this fix first, and then rebase 
https://github.com/openjdk/jdk/pull/29939 to use it. That will make it easier 
for us to hunt down the remaining issues in 
https://github.com/openjdk/jdk/pull/29939.

-------------

Commit messages:
 - comment changes
 - small cosmetics
 - fix release build
 - start

Changes: https://git.openjdk.org/jdk/pull/30232/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=30232&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8379967
  Stats: 467 lines in 9 files changed: 368 ins; 21 del; 78 mod
  Patch: https://git.openjdk.org/jdk/pull/30232.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/30232/head:pull/30232

PR: https://git.openjdk.org/jdk/pull/30232

Reply via email to