I'd like to comment further... Not only is a disturbing message sent to the error log, but a SIGTERM is also sent to the child process. If I understand correctly the SIGTERM will likely interrupt any properly implemented child process shutdown and the child process will exit ungracefully. If it's acceptable to wait longer then the kill call should also be postponed to give modules a chance to cleanup gracefully. If any module has complex IPC or Mutexes in use, graceful shutdown is important especially if MaxRequestsPerChild is in use on a server with heavy load.
-Noah -----Original Message----- From: Jeff Trawick [mailto:[EMAIL PROTECTED] Sent: Friday, August 13, 2004 10:27 AM To: [EMAIL PROTECTED] Subject: Re: [PATCH] fix child reclaim timing On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton <[EMAIL PROTECTED]> wrote: > The 2.0 ap_reclaim_child_processes logic seems to be broken - it never > resets the waittime variable as it did in 1.3; so the parent will wait > for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP > a child and strace the parent to see for yourself) > > This updates the logic to be a little more sane: > > - at t + 16, 82, 344 ms, just waitpid() > - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child > - at t + 1.74 secs, waitpid() else SIGKILL the child > - at t + 1.75, 1.82 secs, just waitpid() > - at t + 2.08 secs, waitpid() else log "this child won't die" > > Any comments? Here is my take on what is wrong with current code: 1) It starts complaining a bit too soon. Some third-party modules have rather complicated child exit strategies. Whether or not that is good or bad (bad ;) ), it results in disturbing messages that wouldn't have appeared if we were a little more patient (2-3 seconds). Also, I suspect that the use of threaded MPM affects how quickly the children are exiting now on Unix. 2) It should never stop checking for exited processes less often than 1-2 seconds, even if it doesn't complain to error log that often. Like you say, current code can wait a VERY long time for child processes to exit. In practice, I see that it can wait a VERY long time even after the last child has exited. I'll agree that it should never wait so long, though I think around 15 or so seconds total is reasonable. Exiting before children are gone doesn't let Apache start up any more quickly; it just prevents potentially-useful information about timing from getting logged to the error log. --/-- I wouldn't complain to error log at all until it has been 2 seconds, and then I'd still wait around for 10-15 more. But it has to check every second so it finds out soon after all children have exited and doesn't sleep needlessly.
