I came across an interesting bug caused by the way ctwm sets up signals,
the POSIX description of SIGCHLD and wait(2), and a common way of
starting a window manager. It's at
and any followups can be read from there, but I'll quote the whole
message here.

Can we avoid setting doing `signal(SIGCHLD, SIG_IGN)`?
It is done in signals.c:setup_signal_handlers().

>From: Taylor R Campbell <campbell+netbsd-tech-userle...@mumble.net>
>To: tech-userle...@netbsd.org
>Subject: system(3) semantics when SIGCHLD is SIG_IGN'd
>Date: Sat, 12 Aug 2023 11:58:36 +0000

What should system(3) do when the signal action for SIGCHLD is

Setting SIGCHLD to SIG_IGN has the effect of reaping zombie children
automatically, so that calling wait(2) is unnecessary to reap them --
and, further, doesn't return _at all_ until the last child has exited.

This semantics -- same as setting SA_NOCLDWAIT -- is enshrined in

    If the calling process has SA_NOCLDWAIT set or has SIGCHLD set to
    SIG_IGN, and the process has no unwaited for children that were
    transformed into zombie processes, the calling thread will block
    until all of the children of the process containing the calling
    thread terminate, and wait() and waitpid() will fail and set errno
    to [ECHILD].


So if a process already has a child, and calls system(3) as it is
currently implemented in libc in ~all versions of NetBSD, system(3)
will hang indefinitely until the existing child exits.

This manifests in newer versions of ctwm which set SIGCHLD to SIG_IGN
if you have a .xsession file that does something like:

        xterm &
        xclock &
        exec ctwm

This causes ctwm to start with two children already, which in turn
causes system(3) to hang when you try to start an application from the
ctwm menu.

The ctwm hang led to PR kern/57527 (https://gnats.netbsd.org/57527,
`kern' because at first it looked like a missing wakeup in the kernel
before we realized this is exactly how POSIX expects SIG_IGN and
SA_NOCLDWAIT to behave), which has some litmus tests for the semantics
and draft code to mitigate the situation in system(3).

So, should we do anything about this in system(3)?

Pro: Makes existing code code like ctwm work.

- POSIX doesn't ask system(3) to work when SIGCHLD is set to SIG_IGN
  or when it has SA_NOCLDWAIT set, so this code is nonportable anyway;
  might break on other systems too, so breakage on NetBSD leading to
  an upstream bug report is helpful.
- Changing signal actions has the side effect of clearing the signal
  queue, and I don't see a way around that.

Alternative would be to say: don't do that; fix the buggy code that
calls system(3) with SIGCHLD ignored, either by having it set a signal
handler that calls waitpid(-1, NULL, WNOHANG) until success, or by
having it use something other than system(3).


Attachment: signature.asc
Description: PGP signature

Reply via email to