Re: `wait -n` returns 127 when it shouldn't

Robert Elz Thu, 18 May 2023 04:52:44 -0700

    Date:        Thu, 18 May 2023 14:16:17 +1000
    From:        Martin D Kealey <[email protected]>
    Message-ID:  
<CAN_U6MWctyFHU0CsKBgGbYgcqXEtOD=4vpiygjx2hvky6bs...@mail.gmail.com>


  | I know that some platforms (used to?) lack all of the âwaitpid()â,

This is irrelevant to the issue at hand (and in general, for shells, is
irrelevant anyway, as shells usually always clean up the process table as
soon as possible, always waiting for anything.   Lack of anything more than
simple wait() can be problematic, as that hangs, which isn't always desired,
but in combination with SIGCHLD (as abominable as that signal is defined to
work on some systems) can be made to function.

But not relevant here, the script is just doing wait -n (no specific pid
requested) and hence there's no need for anything fancy in terms of wait
sys called.

  | If there is silent reaping going on (other than âwait -nâ or âtrap ...
  | SIGCHLDâ)

In practice, there always is, in all shells.

  | shouldn't the exit status and pid of each silently reaped process
  | be retained in a queue that âwait -nâ can extract from,

Yes, that is what is supposed to happen.   And does.   The question is
when jobs are removed from that queue.

  | Would you care to speculate more precisely on where such silent reaping may
  | occur, given the code as shown?

Apparently, in bash, if the code is running in a (shell) loop (like inside
a while, or similar, loop) then each iteration around the loop, any jobs that
have exited, but not been cleaned already, are removed from the queue (the
jobs table in practice, though bash may also have something else).

That's really broken, and should be fixed (but has apparently been that
way for decades, and no-one noticed).

The intent is to avoid the queue growing infinitely big in the case of
loops like

        while :; do process& maybe other code but not doing wait; done

Note this does not need to be a very speedy loop, just one that runs
forever, and never cleans anything up.   That's broken, but in old shell
scripts, hard to avoid, as the only cleanup method was a simple "wait" which
would wait until all background processes completed, defeating the purpose.

In the script in question, the offending loop isn't the one in the main
program - in that for each iteration the background processes are started,
and waited for, in each iteration, but the one in the waitjobs function.
which (appears at first glance, which is all the analysis shells ever do)
to be an infinite loop, so each time around, if there are any completed
jobs in the table, they're removed.   Then, if nothing is still running,
wait -n returns 127, and we exit.   If we're lucky, we get to the wait -n
before the false job finishes, and wait -n collects that one (what happens
to the background true is completely irrelevant to this script), and
everything iterates.   If we're unlucky, false has already completed, and
its status is lost, before we get a chance to wait for it.

Simply broken.

What bash should be doing is limiting the number of jobs that can be in
the jobs table (to perhaps a few hundred) - deleting the oldest completed
ones if more jobs need to be added.   That's allowed, solves the infinite
new job problem, and allows sane programs that do wait for their children
to avoid this kind of issue.

  | PS: I'm not convinced that âtrap ... SIGCHLDâ needs to be in that list;

No, shell level SIGCHLD traps are irrelevant.    The semantics of SIGCHLD
means that they can't rationally be mapped directly from SIGCHLD signals,
those things are hopeless and need to be handled specially by the shell
(or always kept at SIG_DFL so they never occur) or things fail badly.

kre

Re: `wait -n` returns 127 when it shouldn't

Reply via email to