On 4/7/07, Brian Candler <[EMAIL PROTECTED]> wrote:
...
Let's say there's a separate thread which blocks on a wait() call, and once
it has gotten the pid it updates this data structure to remove the entry for
<pid>

Now, it seems to me there is a race condition here: between wait() returning
and the <pid> entry being removed from the data structure, the main program
may have forked off another child with the same <pid>

Protecting the 'wait' and 'fork' threads with a mutex doesn't help. If I
lock the mutex before calling wait() then I prevent all forks for an
indefinite period of time; if I lock the mutex after calling wait() then the
race still exists, as the forking thread may already have the mutex and be
in the process of forking another child with the same pid.

So, what's the best way to handle this? Options I can think of are:

(1) Polling.
...
(2) Modify the data structure <...>
...
(3) If there were an option to waitpid() which could tell you the pid of a
terminated process *without* reaping it, then it becomes easy:
...
Any other suggestions as to the best way to avoid this problem? I'm sure
this must be old ground :-)

Instead of separating the obtaining of the pid from the actual
reaping, you can instead separate the blocking from the return of the
pid+reaping.  That lets you lock the datastructure only when you know
wait() won't block.  To block until a child is ready to be reaped, use
SIGCHLD, blocking it when you aren't ready, ala:

volatile sig_atomic_t  saw_sigchld;
sigset_t orig_sigset

void handle_sigchld(int sig)
{
   saw_sigchld = 1;
}

/* do this once before creating any threads, so that they're all
blocking SIGCHLD */
int init(void)
{
   struct sigaction sa;
   sigset_t set;

   sigemptyset(&sa.sa_mask);
   sa.sa_flags = 0;
   sa.sa_handler = &handle_sigchld;
   if (sigaction(SIGCHLD, &sa, NULL))
       return errno;

   sigemptyset(&set);
   sigaddset(&set, SIGCHLD);
   return pthread_sigmask(SIG_BLOCK, &sigset, &orig_sigset);
}


void my_wait_loop(void)
{
   pid_t pid;
   int cstat, err;

   for (;;)
   {
       while (!saw_sigchld)
       {
           sigsuspend(&orig_sigset);
       }

       saw_sigchld = 0;

       lock_the_shared_datastructure();
       do
       {
           pid = waitpid(-1, &cstat, WNOHANG);
       } while (pid < 0 && (err = errno) == EINTR);
       if (pid > 0)
       {
           handle_exited_child(pid);
       }
       else if (pid == 0 || err == ECHILD)
       {
           /* bogus SIGCHLD, just ignore it */
       }
       else
       {
           /* should not occur (EFAULT?  EINVAL?) */
           syslog("unexpected waitpid() error: %s", strerror(err));
       }
       unlock_the_shared_datastructure();
   }
}


Your fork() code should call "sigprocmask(SIG_SETMASK, &orig_sigset,
NULL);" in the _child_, and if it isn't calling an exec-family
function then it should also reset SIGCHLD to SIG_DFL to avoid
possible conflicts with library calls.

Make sense?


Philip Guenther

Reply via email to