Got a strange problem report from an Apache 1.3.12 customer. One of the
local IBMers is able to reproduce it on his Solaris 2.6 box, so I've
seen it with my own eyes, and had some hands-on.
With 14 piped rotatelogs directives in the config file, "apachectl stop"
doesn't produce the desired result most of the time. With 12 such
directives, it works fine.
The httpd parent process is hung, waiting for otherchild logic to clean
up the piped logs processes. "ps -ef | grep rotatelogs" shows that some
of the /bin/sh processes and their associated rotatelogs processes have
gone down cleanly, but others pairs are still there.
truss shows the parent is periodically sending SIGTERM to the /bin/sh
processes (which are registered via otherchild), then there's a 4x
exponential backoff to an ungodly long time interval (over 15 minutes,
IIRC). So far, working as coded, from the Apache point of view.
But further truss'es and experiments with "kill" from the console seem
to indicate that the /bin/sh processes are not doing anything with the
SIGTERM, or possibly it's getting lost in kernel-land before it gets to
our rotatelogs processes. "kill <rotatelogs_pid>" works as expected -
that process and its shell parent both go away quickly; "kill
<shell_pid>" is no help.
If that's not odd enough, the customer says that sometimes if he uses
truss (on the httpd parent I believe), the problem mysteriously goes
away. It sure didn't go away for me.
Help! Has anybody seen this kind of thing before? Any other debugging
ideas?
Greg (who would love to hear that it's fixed in Solaris x.y)