Apologies for a quick double post, strace is fairly straightforward and confirms that bash is properly reaping the killed processes. This isn't a matter of the wait syscall failing to return the signaled child process.
Running the test from my original post and producing: TEST: KILL PRIOR TO wait -n @0 kill -TERM 6941 @0 ./test.sh: line 13: wait: 6941: no such job wait -n 6941 return code 127 @2 (BUG) wait 6941 return code 143 @2 TEST: KILL DURING wait -n @2 kill -TERM 6970 @3 wait -n 6970 return code 143 @3 wait 6970 return code 143 @3 shows: kill(6941, SIGTERM) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6941, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 6941 wait4(-1, 0xffffc62b6d50, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) and wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 6970 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0xaaaad98a21a4, sa_mask=[], sa_flags=0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6970, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 6972 wait4(-1, 0xffffc62b6860, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) Signaling prior to wait -n (pid 6941) is awaited (wait4) in the SIGCHLD signal handler and determines that it was signaled and terminated due to SIGTERM. Signaling during wait -n (pid 6970) is awaited prior to the SIGCHLD signal indicating it was killed by a blocking call to wait4, also returning that it was signaled and terminated due to SIGTERM. The only difference I see here is whether the subprocess is awaited by the blocking call rather than the nonblocking call inside the SIGCHLD handler. For what it's worth I see subprocesses that terminate without signal also showing up in wait4 calls outside the SIGCHLD handler but this could easily be a matter of chance timing and a red herring. Steve On Wed, Jan 24, 2024 at 12:40 PM Steven Pelley <stevenpel...@gmail.com> wrote: > > > In the first case, if the subprocess N has terminated, its report is > > still queued and "wait" retrieves it. In the second case, if the > > subprocess N has terminated, it doesn't exist and as the manual page > > says "If id specifies a non-existent process or job, the return status > > is 127." > > > > What you're pointing out is that that creates a race condition when the > > subprocess ends before the "wait". And it seems that the kernel has > > enough information to tell "wait -n N", "process N doesn't exist, but > > you do have a queued termination report for it". But it's not clear > > that there's a way to ask the kernel for that information without > > reading all the queued termination reports (and losing the ability to > > return them for other "wait" calls). > > Thanks for the response, but I don't believe this is correct. > > Your understanding of the wait syscall is correct except that the exit > code and process information always remains available until the > process is awaited by its parent -- it is the wait syscall that itself > reaps the process and makes it unavailable to later searches by pid. > There is a possibility that the parent (bash in this case) might reap > the process in multiple ways (i.e., from different threads, setting > the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the > SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that > race with each other, but the parent is always given an opportunity to > read the exit code and reap the process if not disabled with SIGCHLD > handler configuration. > > My understanding of bash is that it internally maintains a queue/list > of finished child jobs to return such that wait -n mimics aspects of > the wait syscall. The discussion at > https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html > supports that bash "silently" reaps child processes and decouples the > wait syscall from the wait command. > > I assume it's possible to confirm that bash is awaiting the process > and retrieving the exit code via ptrace/strace but I'm unfamiliar with > these tools or bash logs. > > The test below allows the subprocess to complete normally, without > being signaled, and then successfully retrieves its exit code via wait > -n. This subprocess terminates before the call to wait -n. I see no > documented reason that a process terminating without signal prior to > wait -n should be returned while a process terminating with signal > prior to wait -n should not. > > echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" > { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & > pid=$! > echo "child proc $pid @${SECONDS}" > > sleep 2 > wait -n $pid > echo "wait -n $pid return code $? @${SECONDS}" > > > For which I get output: > TEST: EXIT 0 PRIOR TO wait -n @0 > child proc 2270 @0 > child finishing @1 > wait -n 2270 return code 1 @2 > > > Steve