Geoffrey Keating wrote:
The intermittent failures on Darwin are due to a kernel bug tripped by
java.lang.Process.waitFor().
The bug appears to be that if:
- the program is multithreaded
- it is blocking SIGCHLD
- it receives a SIGCHLD due to a process terminating
- later it calls sigsuspend (but not sigwait)
then the SIGCHLD may never be delivered, and so the process will wait
for one forever.
It's intermittent because it works fine if the sigsuspend starts before
the SIGCHLD is sent. This also explains why it happens more often with
gij.
I've filed this as <rdar://problem/4736203>. We could work around it
by using a timeout of some kind; for example, creating a new thread
which sends a SIGCHLD manually after some period of time. (Obviously,
only on Darwin, and maybe only on versions with the bug.) Do we think
this is a good idea?
The obvious solution would be to use a kernel without the bug. But
since you want to work around the bug, it seems you want a solution for
kernels with the bug also.
Your suggestion to have a thread that periodically sends SIGCHLD seems
like it should work (but would be ugly). You could have a configure
test to detect configurations where the workaround was needed. Then in
natPosixProcess.cc you would add code to create said thread based on the
configure check.
David Daney