The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
resets the waittime variable as it did in 1.3; so the parent will wait
for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP
a child and strace the parent to see for yourself)
This updates the logic to be a little more sane:
- at t + 16, 82, 344 ms, just waitpid()
- at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
- at t + 1.74 secs, waitpid() else SIGKILL the child
- at t + 1.75, 1.82 secs, just waitpid()
- at t + 2.08 secs, waitpid() else log "this child won't die"
Any comments?
Index: mpm_common.c
===================================================================
RCS file: /home/cvs/httpd-2.0/server/mpm_common.c,v
retrieving revision 1.120
diff -u -r1.120 mpm_common.c
--- mpm_common.c 15 Mar 2004 23:08:41 -0000 1.120
+++ mpm_common.c 13 Aug 2004 13:42:47 -0000
@@ -70,7 +70,7 @@
ap_mpm_query(AP_MPMQ_MAX_DAEMON_USED, &max_daemons);
- for (tries = terminate ? 4 : 1; tries <= 9; ++tries) {
+ for (tries = terminate ? 4 : 1; tries <= 10; ++tries) {
/* don't want to hold up progress any more than
* necessary, but we need to allow children a few moments to exit.
* Set delay with an exponential backoff.
@@ -98,13 +98,15 @@
switch (tries) {
case 1: /* 16ms */
case 2: /* 82ms */
+ break;
+
case 3: /* 344ms */
- case 4: /* 16ms */
+ waittime = 16 * 1024;
break;
-
- case 5: /* 82ms */
- case 6: /* 344ms */
- case 7: /* 1.4sec */
+
+ case 4: /* 360ms */
+ case 5: /* 425ms */
+ case 6: /* 688ms */
/* ok, now it's being annoying */
ap_log_error(APLOG_MARK, APLOG_WARNING,
0, ap_server_conf,
@@ -114,7 +116,7 @@
kill(pid, SIGTERM);
break;
- case 8: /* 6 sec */
+ case 7: /* 1.74 sec */
/* die child scum */
ap_log_error(APLOG_MARK, APLOG_ERR,
0, ap_server_conf,
@@ -132,9 +134,14 @@
*/
kill_thread(pid);
#endif
+ waittime = 16 * 1024;
+ break;
+
+ case 8: /* 1.75 sec */
+ case 9: /* 1.82 sec */
break;
- case 9: /* 14 sec */
+ case 10: /* 2.08 secs */
/* gave it our best shot, but alas... If this really
* is a child we are trying to kill and it really hasn't
* exited, we will likely fail to bind to the port