On Mon, 15 May 2023 18:45:24 GMT, Roger Riggs <rri...@openjdk.org> wrote:

>> Since JDK13, executing commands in a sub-process defaults to the so called 
>> `POSIX_SPAWN` launching mechanism (i.e. 
>> `-Djdk.lang.Process.launchMechanism=POSIX_SPAWN`) on Linux. This works by 
>> using `posix_spawn(3)` to firstly start a tiny helper program called 
>> `jspawnhelper` in a subprocess. In a second step, `jspawnhelper` reads the 
>> command data from the parent Java process over a Unix pipe and finally 
>> executes (i.e. `execvp(3)`) the requested command.
>> 
>> In cases where the parent process terminates abnormally before 
>> `jspawnhelper` has read all the expected data from the pipe, `jspawnhelper` 
>> will block indefinitely on the reading end of the pipe. This is especially 
>> harmful if the parent process had open sockets, because in that case, the 
>> forked `jspawnhelper` process will inherit them and keep all the 
>> corresponding ports open effectively preventing other processes to bind to 
>> them. Notice that this is not an abstract scenario. We've observed this 
>> regularly in production with services which couldn't be restarted after a 
>> crash after migrating to JDK 17.
>> 
>> The fix of the issue is rather trivial. `jspawnhelper` has to close its 
>> writing end of the pipe which connects it with the parent Java process 
>> *before* starting to read from that pipe such that reads from the pipe can 
>> immediately return with EOF if the parent process terminates abnormally.
>> 
>> Also did some cleanup:
>>  - in `jspawnhelper.c` correctly chek the result of `readFully()`
>>  - in `ProcessImpl_md.c` use `restartableWrite()` instead of `write()` to 
>> write the command data from the parent process to `jspawnhelper`
>>  - in `childproc.{c,h}` remove remaining traces of `MODE_CLONE` because 
>> that's long gone.
>
> I think writing a test is worth a little bit of time but getting a clean test 
> run using the KILLTEST might be a bit tricky. The exit(-1) of the parent 
> would be detected by jtreg as a failure. And verifying that the spawned child 
> runs to completion might be difficult without writing a driver app to monitor 
> the child; though it may be sufficient to know just that it does not hang and 
> timeout. When the parent dies, the child will be re-parented.
> And it would only be run in debug builds.

> @RogerRiggs @simonis
> 
> Okay, I wrote a jtreg test case for the issue, just to see if I could. 
> Rebased atop of Volkers change, you can just take the commit if you want:
> 
> https://github.com/tstuefe/jdk/tree/Added-TestCase-For-Volker 
> [9119b44](https://github.com/openjdk/jdk/commit/9119b442dac8dd95228d47d70871ee59862649ce)
> 
> Test case fails immediately (so, no relying on timeouts) if Volkers patch is 
> missing. Succeeds immediately with Volkers patch. I also tried commenting out 
> the fixing close() call, test fails immediately.
> 
> Patch relies on jspawnhelper to write a small marker file on communication 
> error with parent if KILLTEST is set. Only in debug code. I think this is 
> very reliable. No second-guessing aliveness of the child and worrying about 
> PID reuse. Re-parenting is no issue either. And I think such a file make be a 
> good idea anyway to analyze handshake errors with jspawnhelper.
> 
> Tested on Linux x64.
> 
> @simonis, feel free to add me as contributor if you take this test. I spent 
> way to much time on this :-)

Thank you @tstuefe . Unfortunately I saw this too late yesterday, when I 
already almost finished my version :)

I've now also added some "simulted crashes" to `jspawnhelper` at different 
points of the handshake. This didn't revealed any new issues but might be 
helpful for the future.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13956#issuecomment-1551326215

Reply via email to