Brian Paine wrote:
another idea:Can you grab a stack trace from each of the httpd processes (e.g., with pstack, or by scripting gdb) to see what they're doing when the load drops to zero?
ps ax0 ppid,wchan | grep httpd
...when it's being bad (i.e. when the load is 0 and it shouldn't be).
The wchan field shows the first n characters of name of the kernel function the process is blocked in, or "-" if it's not blocked in the kernel. ppid will tell you which process is the parent, which probably isn't the culprit.
It sounds like a process is grabbing the accept mutex and hanging on to it for an unreasonable amount of time occasionally. It could be blocked in accept or poll (normal when the server is idle; abnormal if you have unprocessed ESTABLISHED httpd connections) or looping (unlikely if the load is zero).
In any case, I suspect the wchan field for the guilty process will be different than the others. If that's true, then its value should give us a hint. If you're quick (or write a little perl script), maybe you could attach to that process with gdb before it breaks loose and see what the stack looks like.
Greg
