Ok, after wading through the code for awhile I have a working theory:
1) Parent creats a child 2) Parent gets graceful-restart signal 3) Parent returns from ap_run_mpm, pconf is cleared, cross-process lock file is closed and removed. 4) Child finally gets scheduled to run the apr_proc_mutex_child_init for fcntl(). Oops, apr_file_open fails since step #3 above removed the file. Child errors out (ENOENT is returned from apr_file_open()) and dies. 5) Parent notices that child has died, errors out and dies completely.
sounds very possible
hopefully it is sane if parent doesn't exit out if a prior generation child reports APEXIT_CHILDFATAL; but it looks like prefork checks for APEXIT_CHILDFATAL before checking if it is a current-generation child
In any case, can anyone else confirm that this race condition exists, and maybe suggest a way to synchronize a parent's shutdown with the starting up of an old-generation child? (Eg. the parent shouldn't remove the lockfile until all children are successfully started.)
it shouldn't be bad to remove the lockfile when it is done now, and certainly that new child of old generation should exit ASAP anyway since it has old config; I suspect if parent ignores "fatal" exits of such children we'd be okay
no guesses from me on whether this race condition is what causes the problem
