I'm trying to debug a problem where apparently the accept mutex went bad on a z/OS system running the worker MPM. I'm guessing that some memory that we use for the semaphore got clobbered but don't have proof yet. The error log looks like:
[Mon Sep 07 08:01:59 2009] [emerg] (121)EDC5121I Invalid argument.: apr_proc_mutex_unlock failed. Attempting to shutdown process gracefully. [Mon Sep 07 08:02:01 2009] [emerg] (121)EDC5121I Invalid argument.: apr_proc_mutex_lock failed. Attempting to shutdown process gracefully. [Mon Sep 07 08:02:02 2009] [emerg] (121)EDC5121I Invalid argument.: apr_proc_mutex_lock failed. Attempting to shutdown process gracefully. [Mon Sep 07 08:02:02 2009] [emerg] (121)EDC5121I Invalid argument.: apr_proc_mutex_lock failed. Attempting to shutdown process gracefully. [...] The rest of the error log is filled with lock failures. Looking at the time stamps, you can see that perform_idle_server_maintenance went into exponential expansion, maxing out at about 24 lock failures per second. Unfortunately the fork()s were faster than z/OS could terminate the processes that had detected the mutex problem, so after forking 978 httpd children, the system ran out of real memory and had to be IPLed. One of my colleagues asked why ServerLimit 64 didn't stop the fork bomb. Good question. The reason is that the error path calls signal_threads() which causes the child to exit gracefully. The listener thread sets ps->quiescing on the way out, which allows the "squatting" logic in perform_idle_server_maintenance to take over the scoreboard slot before the previous process has completely exited, bypassing the ServerLimit throttle. This raises several ideas for improvement: * Should we do clean_child_exit(APEXIT_CHILDSICK or CHILDFATAL) for this error? We have a previous fix to detect accept mutex failures during restarts and tone down the error messages. I don't recall seeing any false error messages after that fix. We could also use requests_this_child to detect if this process has ever successfully served a request, and only do the clean_child_exit if it hasn't. * Should we yank the squatting logic? I think it is doing us more harm than good. IIRC it was put in to make the server respond faster when the workload is spikey. A more robust solution may be to set Min and MaxSpareThreads farther apart and allow ServerLimit to be enforced unconditionally. disclaimer: I created ps->quiescing, so I was an accomplice. * Does it make sense to fork more than MaxSpareThreads worth of child processes at a time? MaxSpareThreads was 75 in this case, but we tried to fork at least 600 threads (same as MaxClients) worth of child processes in one pass of perform_idle_server_maintenance. This applies to worker and event; some of it may also apply to prefork. I'd appreciate thoughts and suggestions before committing anything. Thanks, Greg