On Sat, Apr 07, 2007 at 12:15:07PM +0100, Brian Candler wrote:
> I have a question about the semantics of wait()/waitpid().
> 
> My understanding is, as soon as wait() returns, the process is gone from the
> process table, and therefore another fork() on the system could immediately
> re-use the same PID. Is that correct?
> 
> Now let's suppose I have a program which forks children when it needs them.
> It maintains a datastructure which is a hash of { pid => info }
> 
> Let's say there's a separate thread which blocks on a wait() call, and once
> it has gotten the pid it updates this data structure to remove the entry for
> <pid>
> 
> Now, it seems to me there is a race condition here: between wait() returning
> and the <pid> entry being removed from the data structure, the main program
> may have forked off another child with the same <pid>

If you've got that problem you already have other problems. Breeding and
reaping children with associated shared data structures is asking for
trouble unless you synchronize. If you solve that so that you're either
spawning a child OR reaping a child (never BOTH), then the pid reuse
isn't a problem anyway. And you do need to solve that, or you're going
to end up with mangled data.

> Protecting the 'wait' and 'fork' threads with a mutex doesn't help. If I
> lock the mutex before calling wait() then I prevent all forks for an
> indefinite period of time; if I lock the mutex after calling wait() then the
> race still exists, as the forking thread may already have the mutex and be
> in the process of forking another child with the same pid.
> 
> So, what's the best way to handle this? Options I can think of are:
> 
> (1) Polling.
> 
> - lock mutex
> - call waitpid(-1, 0, WNOHANG)
> - update the data structure
> - unlock mutex
> - sleep 100ms
> - go back to start
> 
> This seems rather icky.
> 
> (2) Modify the data structure to allow for the unlikely, but possible,
> situation of having two processes with the same PID: one which has just been
> reaped, and one which has just been forked. The reap process then removes
> the first entry for the PID returned from wait().
> 
> This gives a messy datastructure just for handling this edge case.
> 
> (3) If there were an option to waitpid() which could tell you the pid of a
> terminated process *without* reaping it, then it becomes easy:
> 
> - waitpid(-1, 0, WNOWAIT)
> - update the data structure to remove the entry for this pid
> - waitpid(pid, 0, 0) to remove it from the process table
> 
> It looks like Linux has a waitid() call with a WNOWAIT option, but I can't
> see anything in the wait manpage for OpenBSD (4.0) which works this way.
> 
> Any other suggestions as to the best way to avoid this problem? I'm sure
> this must be old ground :-)
> 
> Thanks,
> 
> Brian.

-- 
Darrin Chandler            |  Phoenix BSD User Group  |  MetaBUG
[EMAIL PROTECTED]   |  http://phxbug.org/      |  http://metabug.org/
http://www.stilyagin.com/  |  Daemons in the Desert   |  Global BUG Federation

Reply via email to