Ok, after wading through the code for awhile I have a working theory:

1) Parent creats a child
2) Parent gets graceful-restart signal
3) Parent returns from ap_run_mpm, pconf is cleared, cross-process lock file
   is closed and removed.
4) Child finally gets scheduled to run the apr_proc_mutex_child_init for
   fcntl(). Oops, apr_file_open fails since step #3 above removed the file.
   Child errors out (ENOENT is returned from apr_file_open()) and dies.
5) Parent notices that child has died, errors out and dies completely.

Note that the 2.0 branch likely has this same problem (which will only
show up very rarely under graceful restarts while using fcntl() for the
accept mutex). Both the www.apache.org build (2.0) and the cvs.apache.org
build (2.1-dev) are seeing accept.lock.<pid> turds being left behind,
and I think it is likely that one is left behind each time we hit this
bug.

One way to recreate this might be to pummel the server with graceful
restarts while also pummeling it with requests (enough requests to get
the parent to need to create new children).

In any case, can anyone else confirm that this race condition exists, and
maybe suggest a way to synchronize a parent's shutdown with the starting
up of an old-generation child? (Eg. the parent shouldn't remove the
lockfile until all children are successfully started.)

-aaron


On Sun, Mar 14, 2004 at 10:15:43AM -0800, Justin Erenkrantz wrote:
> This morning, when we did a graceful to the httpd serving cvs.apache.org 
> (which runs HEAD not APACHE_2_0_BRANCH), it failed and gave us:
> 
> [Sun Mar 14 00:00:00 2004] [emerg] (2)No such file or directory: Couldn't 
> initialize cross-process lock in child
> [Sun Mar 14 00:00:00 2004] [emerg] (2)No such file or directory: Couldn't 
> initialize cross-process lock in child
> [Sun Mar 14 00:00:00 2004] [alert] Child 10485 returned a Fatal error... 
> server is exiting!
> 
> It subsequently brought down the entire server with it.  (That's sort of 
> bad, too.)
> 
> This error lines up with prefork.c around line 485:
> 
> status = apr_proc_mutex_child_init(&accept_mutex, ap_lock_fname, pchild);
> if (status != APR_SUCCESS) {
>  ap_log_error(APLOG_MARK, APLOG_EMERG, status, ap_server_conf,
>               "Couldn't initialize cross-process lock in child");
>  clean_child_exit(APEXIT_CHILDFATAL);
> }
> 
> We don't have a LockFile or an AcceptMutex directive, so it should be using 
> the default, which is flock() on FreeBSD.
> 
> Anyone else seen this?  Should we switch the AcceptMutex directive to 
> fcntl()?
> (If this does fail with flock(), should we just remove support for flock()?)
> 
> Thanks!  -- justin
> 

Reply via email to