On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton <[EMAIL PROTECTED]> wrote: > The 2.0 ap_reclaim_child_processes logic seems to be broken - it never > resets the waittime variable as it did in 1.3; so the parent will wait > for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP > a child and strace the parent to see for yourself) > > This updates the logic to be a little more sane: > > - at t + 16, 82, 344 ms, just waitpid() > - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child > - at t + 1.74 secs, waitpid() else SIGKILL the child > - at t + 1.75, 1.82 secs, just waitpid() > - at t + 2.08 secs, waitpid() else log "this child won't die" > > Any comments?
Here is my take on what is wrong with current code: 1) It starts complaining a bit too soon. Some third-party modules have rather complicated child exit strategies. Whether or not that is good or bad (bad ;) ), it results in disturbing messages that wouldn't have appeared if we were a little more patient (2-3 seconds). Also, I suspect that the use of threaded MPM affects how quickly the children are exiting now on Unix. 2) It should never stop checking for exited processes less often than 1-2 seconds, even if it doesn't complain to error log that often. Like you say, current code can wait a VERY long time for child processes to exit. In practice, I see that it can wait a VERY long time even after the last child has exited. I'll agree that it should never wait so long, though I think around 15 or so seconds total is reasonable. Exiting before children are gone doesn't let Apache start up any more quickly; it just prevents potentially-useful information about timing from getting logged to the error log. --/-- I wouldn't complain to error log at all until it has been 2 seconds, and then I'd still wait around for 10-15 more. But it has to check every second so it finds out soon after all children have exited and doesn't sleep needlessly.
