This is work in progress.

This cures an issue that would very rarely yet often enough to be a real
nuisance cause a deadlock in the child in fork, when it tries to unlock
ss' critical section lock at the end of fork and find that the global
sigstate already is locked and never will be unlocked; deadlock.  This
will typically (always?) be observed in /bin/sh, which is not surprising
as that is the foremost caller of fork.

To reproduce an intermediate state, add an endless loop if
_hurd_global_sigstate is locked after __proc_dostop (cast through
volatile); that is, while still being in the fork's parent process.

When that triggers (use the libtool testsuite), the signal thread has
already locked ss (which is _hurd_global_sigstate), and is stuck at
hurdsig.c:685 in post_signal, trying to lock _hurd_siglock (which the
main thread already has locked and keeps locked until after
__task_create).  This is the case that ss->thread == MACH_PORT_NULL, that
is, a global signal.  In the main thread, between __proc_dostop and
__task_create is the __thread_abort call on the signal thread which would
abort any current kernel operation (but leave ss locked).  Later in fork,
in the parent, when _hurd_siglock is unlocked in fork, the parent's
signal thread can proceed and will unlock eventually the global sigstate.
In the client, _hurd_siglock will likewise be unlocked, but the global
sigstate never will be, as the client's signal thread has been configured
to restart execution from _hurd_msgport_receive.  Thus, when the child
tries to unlock ss' critical section lock at the end of fork, it will
first lock the global sigstate, will spin trying to lock it, which can
never be successful, and we get our deadlock.

(Incomplete) options seem to be:

  * Move the locking of _hurd_siglock earlier in post_signal -- but that
    may generally impact performance, if this locking isn't generally
    needed anyway?

    On the other hand, would it actually make sense to wait here until we
    are not any longer in a critical section (which is meant to disable
    signal delivery anway (but not for preempted signals?))?

  * Clear the global sigstate in the fork's child with the rationale that
    we're anyway restarting the signal thread from a clean state.  This
    has now been implemented.

Why has this problem not been observed before Jérémie's patches?  (Or has
it?  Perhaps even more rarely?)  In _S_msg_sig_post, the signal is now
posted to a *global receiver thread*, whereas previously it was posted to
the *designated signal-receiving thread*.  The latter one was in a
critical section in fork, so didn't try to handle the signal until after
leaving the critical section?  (Not completely analyzed and verified.)

Another question is what the signal is that is being received
during/around the time __proc_dostop executes.

For now, I have committed the following patch as commit
fd0bd821d522b006de9c10cb444ba878508c47e7 on top of Jérémie's patches in
Savannah glibc's t/hurdsig-global-dispositions branch.  Samuel, you may
want to propagate that into the Debian patchset.

I intend to continue working on this issue to fully understand what is
going on there -- this patch, while it seems to work fine, doesn't
exactly look like the proper fix yet.  I already have (locally) added
annotations and questions to the Hurd's signal code in glibc to be
clarified and answered.  In that process of learning that code, I also
plan to FINALLY review Jérémie's patch series touching/enhancing/fixing
the signal code in glibc, and work on getting that integrated in glibc
upstream.  Jérémie, I guess you don't have time at the moment for
collaborating there?

        * sysdeps/mach/hurd/fork.c (__fork): In the child, reinitialize
        the global sigstate's lock.

diff --git sysdeps/mach/hurd/fork.c sysdeps/mach/hurd/fork.c
index 9f11130..b89860f 100644
--- sysdeps/mach/hurd/fork.c
+++ sysdeps/mach/hurd/fork.c
@@ -635,6 +635,21 @@ __fork (void)
       ss->next = NULL;
       _hurd_sigstates = ss;
       __mutex_unlock (&_hurd_siglock);
+      /* Earlier on, the global sigstate may have been tainted and now needs to
+         be reinitialized.  Nobody is interested in its present state anymore:
+         we're not, the signal thread will be restarted, and there are no other
+         threads.
+         We can't simply allocate a fresh global sigstate here, as
+         _hurd_thread_sigstate will call malloc and that will deadlock trying
+         to determine the current thread's sigstate.  */
+#if 0
+      _hurd_thread_sigstate_init (_hurd_global_sigstate, MACH_PORT_NULL);
+      /* Only reinitialize the lock -- otherwise we might have to do additional
+         setup as done in hurdsig.c:_hurdsig_init.  */
+      __spin_lock_init (&_hurd_global_sigstate->lock);
       /* We are one of the (exactly) two threads in this new task, we
         will take the task-global signals.  */

I have not yet checked whether posix_spawn has a similar issue.


Attachment: pgpIn2cVx9bjj.pgp
Description: PGP signature

Reply via email to