accept mutex failure causes fork bomb

Greg Ames Mon, 14 Sep 2009 13:28:30 -0700

I'm trying to debug a problem where apparently the accept mutex went bad on
a z/OS system running the worker MPM.  I'm guessing that some memory that we
use for the semaphore got clobbered but don't have proof yet.  The error log
looks like:


[Mon Sep 07 08:01:59 2009] [emerg] (121)EDC5121I Invalid argument.:
apr_proc_mutex_unlock failed. Attempting to shutdown process gracefully.
[Mon Sep 07 08:02:01 2009] [emerg] (121)EDC5121I Invalid argument.:
apr_proc_mutex_lock failed. Attempting to shutdown process gracefully.
[Mon Sep 07 08:02:02 2009] [emerg] (121)EDC5121I Invalid argument.:
apr_proc_mutex_lock failed. Attempting to shutdown process gracefully.
[Mon Sep 07 08:02:02 2009] [emerg] (121)EDC5121I Invalid argument.:
apr_proc_mutex_lock failed. Attempting to shutdown process gracefully.
[...]

The rest of the error log is filled with lock failures.  Looking at the time
stamps, you can see that perform_idle_server_maintenance went into
exponential expansion, maxing out at about 24 lock failures per second.
Unfortunately the fork()s were faster than z/OS could terminate the
processes that had detected the mutex problem, so after forking 978 httpd
children, the system ran out of real memory and had to be IPLed.

One of my colleagues asked why ServerLimit 64 didn't stop the fork bomb.
Good question.  The reason is that the error path calls signal_threads()
which causes the child to exit gracefully.  The listener thread sets
ps->quiescing on the way out, which allows the "squatting" logic in
perform_idle_server_maintenance to take over the scoreboard slot before the
previous process has completely exited, bypassing the ServerLimit throttle.

This raises several ideas for improvement:

* Should we do clean_child_exit(APEXIT_CHILDSICK or CHILDFATAL) for this
error?  We have a previous fix to detect accept mutex failures during
restarts and tone down the error messages.  I don't recall seeing any false
error messages after that fix.  We could also use requests_this_child to
detect if this process has ever successfully served a request, and only do
the clean_child_exit if it hasn't.

* Should we yank the squatting logic?  I think it is doing us more harm than
good.  IIRC it was put in to make the server respond faster when the
workload is spikey.  A more robust solution may be to set Min and
MaxSpareThreads farther apart and allow ServerLimit to be enforced
unconditionally.  disclaimer: I created ps->quiescing, so I was an
accomplice.

* Does it make sense to fork more than MaxSpareThreads worth of child
processes at a time?  MaxSpareThreads was 75 in this case, but we tried to
fork at least 600 threads (same as MaxClients) worth of child processes in
one pass of perform_idle_server_maintenance.

This applies to worker and event; some of it may also apply to prefork.  I'd
appreciate thoughts and suggestions before committing anything.

Thanks,
Greg

accept mutex failure causes fork bomb

Reply via email to