Hi all,
since around 2.4.26 I notice occasional hangs during release testing
only on Solaris 10 Sparc when using MPM prefork.
Those hangs were always observed during proxy testing. The situation was
in most cases (all?), that only one running httpd process was left, so
it would accept the incoming request but could not proxy them and each
request ran into the proxy timeout (I think 1 minute in the test). The
process was in fact an httpd child process. Parent process and other
children were gone.
I now narrowed it down some more. I am not sure, this is the only
situation, but at least this is one I do observe most often:
- no problems with event or worker MPM
- no problems on REHL and SLES (Linux)
- even on Solaris, running without mod_watchdog (and therefore without
mod_heartbeat, mod_heartmonitor, mod_lbmethod_heartbeat and
mod_proxy_hcheck) it seems the hang does not occur.
- as soon as I enable mod_watchdog only (but not the above modules that
would use it), the hangs start to happen every now and then.
The pattern seems to be (from the error log):
[Sat Jan 26 21:36:01.855207 2019] [core:trace4] [pid 7850]
mpm_common.c(536): mpm child 7853 (gen 0/slot 0) started
[Sat Jan 26 21:36:01.875826 2019] [mpm_prefork:notice] [pid 7850]
AH00163: Apache/2.4.38 (Unix) OpenSSL/1.0.2 configured -- resuming
normal operations
[Sat Jan 26 21:36:01.875975 2019] [mpm_prefork:info] [pid 7850] AH00164:
Server built: Jan 18 2019 17:54:50
[Sat Jan 26 21:36:01.876208 2019] [core:notice] [pid 7850] AH00094:
Command line: '/path/to/bin/httpd -d /path/to/t -f
/path/to/t/conf/httpd.conf -D APACHE2 -D APACHE2_4'
[Sat Jan 26 21:36:01.876340 2019] [core:debug] [pid 7850] log.c(1571):
AH02639: Using SO_REUSEPORT: no (1)
[Sat Jan 26 21:36:01.876417 2019] [mpm_prefork:debug] [pid 7850]
prefork.c(923): AH00165: Accept mutex: pthread (default: pthread)
[Sat Jan 26 21:36:01.875753 2019] [core:trace4] [pid 7850]
mpm_common.c(536): mpm child 7854 (gen 0/slot 1) started
...
[Sat Jan 26 21:36:18.969742 2019] [core:trace4] [pid 7850]
mpm_common.c(536): mpm child 7889 (gen 0/slot 2) started
...
SHOULD BE FINE UNTIL HERE
...
[Sat Jan 26 21:38:29.016961 2019] [mpm_prefork:error] [pid 7850]
AH00161: server reached MaxRequestWorkers setting, consider raising the
MaxRequestWorkers setting
...
UNCLEAR, WHETHER THAT IS UNEXPECTED, BUT THEN THE REAL ERROR
...
[Sat Jan 26 21:42:48.213005 2019] [mpm_prefork:emerg] [pid 7889]
(45)Deadlock situation detected/avoided: AH00144: couldn't grab the
accept mutex
[Sat Jan 26 21:42:48.490884 2019] [mpm_prefork:emerg] [pid 7853]
(45)Deadlock situation detected/avoided: AH00144: couldn't grab the
accept mutex
...
[Sat Jan 26 21:42:49.129401 2019] [core:alert] [pid 7850] AH00050: Child
7853 returned a Fatal error... Apache is exiting!
[Sat Jan 26 21:42:49.129519 2019] [:emerg] [pid 7850] AH02818: MPM run
failed, exiting
bin/httpd -V
Server version: Apache/2.4.38 (Unix)
Server built: Jan 18 2019 17:54:50
Server's Module Magic Number: 20120211:83
Server loaded: APR 1.6.5, APR-UTIL 1.6.1
Compiled using: APR 1.6.5, APR-UTIL 1.6.1
Architecture: 32-bit
Server MPM: prefork
threaded: no
forked: yes (variable process count)
Server compiled with....
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_PROC_PTHREAD_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D DYNAMIC_MODULE_LIMIT=256
-D HTTPD_ROOT="..."
-D SUEXEC_BIN=".../bin/suexec"
-D DEFAULT_PIDLOG="logs/httpd.pid"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="conf/mime.types"
-D SERVER_CONFIG_FILE="conf/httpd.conf"
I remember Solaris having a very bad deadlock detection but that goes
back to using fcntl locks. Here we are using pthread locks. Since this
is APR 1.6, my old observation about pthread timed lock problems on
Solaris 10 also does not apply here.
I wonder what is special about prefork and mod_watchdog.
In addition, what I find strange is, that the accept mutex error lead to
child crashes, but not to a start of new child processes. Instead one of
the two child crashes (the second one) seems to be noticed by the parent:
[core:alert] [pid 7850] AH00050: Child 7853 returned a Fatal error...
Apache is exiting!
[:emerg] [pid 7850] AH02818: MPM run failed, exiting
and in fact the parent goes away but the third and last child process
stays and continues to handle requests. That doesn't seem right.
I also see complains about reaching MaxRequestWorkers in the prefork
tests on Solaris. I haven't verified it, but it seems they run with a
fixed limit of three child processes. On the one hand, I would only
expect two of them to be actually used at any time (request plus proxied
request), since the tests are not executed in parallel. So the message
is a bit strange. On the other hand the limit three (if I am right)
seems very limited.
Any hints, ideas, comments about prefork plus mod_watchdog, the
incomplete shutdown after AH00050 plus AH02818, the limit of 3 child
processes and the MaxRequestWorkers warning?
Questions, questions, ... ;)
Regards,
Rainer