Raphael Quinet wrote:
> 
> On Thu, 11 May 2000, Michael Natterer <[EMAIL PROTECTED]> wrote:
> >
> > We don't use SA_NODEFER any more. And AFAIK the delivery of SIGCHLD has
> > nothing to do with cleaning up zombies. This is why we loop around
> > waitpid() because POSIX explicitly says that signals arriving close
> > together may be merged by the kernel.
> 
> The delivery of SIGCHLD does not clean up the zombies: this is done by
> calling waitpid() or any wait*() function.  But the call to waitpid()
> will be triggered by a SIGCHLD, and there is a potential race
> condition.  I will try to explain this in a slightly different way:
> regardless of the setting of SA_NODEFER, you cannot ensure that you
> will get one SIGCHLD for every dead process.  Some kernels merge the
> signals that arrive close together, and some of them mask the signal
> while the corresponding handler is being called (some systems will
> re-deliver the masked signal immediately after the handler returns,
> but this is not always the case).

Yeah, my confusing sentence above :) wanted to say: The # of SIGCHLD
arriving doesn't have to correspond with the # of zombies we have to
clean up. Of course just catching SIGCHLD doesn't touch the zombies
at all.

> Now, consider the following scenario for a race condition:
>   * The Gimp starts two plug-ins
>   * plug-in 1 dies
>   * SIGCHLD delivered, handler called
>   * inside the signal handler:
>   * - waitpid() returns the status of the first process
>   * - waitpid() returns 0 and the while() loop ends
>   * plug-in 2 dies (before the signal handler returns)
>   * SIGCHLD cannot be delivered
>   * - the signal handler returns
>   * The Gimp continues and the status of second plug-in is never
>     collected, so this creates a zombie.
> Although it is rather unlikely that a context switch happens in the
> signal handler just after the while() loop and before returning, it
> can and will happen.
> 
> If you are sure that all kernels of the systems that Gimp supports
> will re-deliver the second SIGCHLD signal immediately after the signal
> handler returns, then there is no problem in the scenario described
> above: the handler will be called a second time and will collect the
> status of the second plug-in.  But I do not think so; that's why I
> suggested to call waitpid() outside the signal handler, because then
> you avoid the race condition.

Hm. I don't think that this is really a problem. The scenario above is,
as you say unlikely but will definitely happen. But _if_ it happens
we just have a stale zombie hanging around which should be cleaned up
the next time the handler is called.

> > > Yes...  I wrote a few months ago that I would change that and implement
> > > some kind of --enable-stack-trace option, but I never took the time to
> > > do it.
> >
> > Now it's there :) We just have to convert the remaining g_print() to write()
> > and the handler will be totally safe if enable_stack_trace == FALSE.
> 
> Hmmm...  I don't know how you have implemented it (I cannot look at
> the CVS code), but I was thinking of having more options for
> --enable-stack-trace, both for the configure option and for the
> command-line option:
> - "never": disable the stack trace and use the default signal handlers
> - "query": ask the user, as we are doing now.
> - "always": always call the debugger without asking any question (useful
>           if stdin does not work, but stdout is still OK)
> - "wait": always wait 30 seconds, without asking any question (this does
>           not use stdin or stdout and gives some time to the user if she
>           wants to attach a debugger to the process)
> If the code is compiled or run with the "never" option, then the
> signal handlers would never be installed.  The configure option would
> set the default value for this flag, but it would always be possible
> to override it with the command-line option without recompiling.  The
> default for the code in CVS and for the unstable tarballs should be
> "query", and the default for the stable releases could be "never" or
> "wait".

Wow, this is a bit more than what I was thinking about. I think this
would be really useful. Still willing to hack it if comeone kicks you ?-)
Ok, kick, kick, kick...

> > We should probably re-add the reentrancy guards in the fatal handlers and
> > just do a brute force exit() if it's called recursively (which can only
> > happen during the stack trace because that's the only case where the signals
> > are unblocked in the handler).
> 
> Another option would be to set the signals to SIG_DFL as soon as the
> fatal handler is called.  Since the default behavior for these signals
> is to kill the program, this would prevent the endless loop without
> requiring a special flag.

Yep, this sounds more useful than the current state (which just unblocks
the signals in the handler if the trace is called and causes the loop
if another signal arrives during and probably because of the trace).

> On Fri, 12 May 2000, Michael Natterer <[EMAIL PROTECTED]> wrote:
> [...]
> > So ignoring SIGPIPE in the app causes it to be ignored in children, too.
> > To avoid the need of resetting the signal we could just connect it
> > to a dummy handler and let exec() do it's job of resetting it.
> 
> Well, as I wrote above and in a previous mail, there is a convenient
> SIG_DFL that exits for that purpose, so you do not have to install a
> dummy handler.

The problem is that SIGPIPE was never delivered (at least on mamy machines).
Installing a dummy handler would ensure that it is _really_ never delivered
because the current code relies on it. And it would ensure that the signal's
disposition is reset to SIG_DFL on exec().

The whole point of installing dummy handlers is to avoid the need to reset
signals to SIG_DFL before an exec().

bye,
--Mitch

Reply via email to