Re: wait -n misses signaled subprocess

Steven Pelley Wed, 24 Jan 2024 10:09:57 -0800

Apologies for a quick double post, strace is fairly straightforward
and confirms that bash is properly reaping the killed processes.  This
isn't a matter of the wait syscall failing to return the signaled
child process.


Running the test from my original post and producing:
TEST: KILL PRIOR TO wait -n @0
kill -TERM 6941 @0
./test.sh: line 13: wait: 6941: no such job
wait -n 6941 return code 127 @2 (BUG)
wait 6941 return code 143 @2
TEST: KILL DURING wait -n @2
kill -TERM 6970 @3
wait -n 6970 return code 143 @3
wait 6970 return code 143 @3

shows:
kill(6941, SIGTERM)                     = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6941,
si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 6941
wait4(-1, 0xffffc62b6d50, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})

and

wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 6970
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0},
{sa_handler=0xaaaad98a21a4, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6970,
si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 6972
wait4(-1, 0xffffc62b6860, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})

Signaling prior to wait -n (pid 6941) is awaited (wait4) in the
SIGCHLD signal handler and determines that it was signaled and
terminated due to SIGTERM.
Signaling during wait -n (pid 6970) is awaited prior to the SIGCHLD
signal indicating it was killed by a blocking call to wait4, also
returning that it was signaled and terminated due to SIGTERM.
The only difference I see here is whether the subprocess is awaited by
the blocking call rather than the nonblocking call inside the SIGCHLD
handler.  For what it's worth I see subprocesses that terminate
without signal also showing up in wait4 calls outside the SIGCHLD
handler but this could easily be a matter of chance timing and a red
herring.

Steve

On Wed, Jan 24, 2024 at 12:40 PM Steven Pelley <stevenpel...@gmail.com> wrote:
>
> > In the first case, if the subprocess N has terminated, its report is
> > still queued and "wait" retrieves it.  In the second case, if the
> > subprocess N has terminated, it doesn't exist and as the manual page
> > says "If id specifies a non-existent process or job, the return status
> > is 127."
> >
> > What you're pointing out is that that creates a race condition when the
> > subprocess ends before the "wait".  And it seems that the kernel has
> > enough information to tell "wait -n N", "process N doesn't exist, but
> > you do have a queued termination report for it".  But it's not clear
> > that there's a way to ask the kernel for that information without
> > reading all the queued termination reports (and losing the ability to
> > return them for other "wait" calls).
>
> Thanks for the response, but I don't believe this is correct.
>
> Your understanding of the wait syscall is correct except that the exit
> code and process information always remains available until the
> process is awaited by its parent -- it is the wait syscall that itself
> reaps the process and makes it unavailable to later searches by pid.
> There is a possibility that the parent (bash in this case) might reap
> the process in multiple ways (i.e., from different threads, setting
> the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the
> SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that
> race with each other, but the parent is always given an opportunity to
> read the exit code and reap the process if not disabled with SIGCHLD
> handler configuration.
>
> My understanding of bash is that it internally maintains a queue/list
> of finished child jobs to return such that wait -n mimics aspects of
> the wait syscall.  The discussion at
> https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html
> supports that bash "silently" reaps child processes and decouples the
> wait syscall from the wait command.
>
> I assume it's possible to confirm that bash is awaiting the process
> and retrieving the exit code via ptrace/strace but I'm unfamiliar with
> these tools or bash logs.
>
> The test below allows the subprocess to complete normally, without
> being signaled, and then successfully retrieves its exit code via wait
> -n.  This subprocess terminates before the call to wait -n.  I see no
> documented reason that a process terminating without signal prior to
> wait -n should be returned while a process terminating with signal
> prior to wait -n should not.
>
> echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
> { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
> pid=$!
> echo "child proc $pid @${SECONDS}"
>
> sleep 2
> wait -n $pid
> echo "wait -n $pid return code $? @${SECONDS}"
>
>
> For which I get output:
> TEST: EXIT 0 PRIOR TO wait -n @0
> child proc 2270 @0
> child finishing @1
> wait -n 2270 return code 1 @2
>
>
> Steve

Re: wait -n misses signaled subprocess

Reply via email to