Only about the part: "the limit of 3 child processes and the MaxRequestWorkers warning". That happens also on Linux. The sizing is part of the test framework and is insufficient, because apart from one child for a request plus another one for a proxied request on the backend, more child processes can be busy while sitting in KeepAliveTimeout from previous proxied requests (we are using persistent backend connections). The timeout is 5 seconds.

I added a little headroom to the prefork sizing in r1852306 to https://svn.apache.org/repos/asf/perl/Apache-Test/trunk.

The real problem, pthread mutex failures on Solaris with prefork and mod_watchdog is still open.

Regards,

Rainer

Am 27.01.2019 um 14:40 schrieb Rainer Jung:
Hi all,

since around 2.4.26 I notice occasional hangs during release testing only on Solaris 10 Sparc when using MPM prefork.

Those hangs were always observed during proxy testing. The situation was in most cases (all?), that only one running httpd process was left, so it would accept the incoming request but could not proxy them and each request ran into the proxy timeout (I think 1 minute in the test). The process was in fact an httpd child process. Parent process and other children were gone.

I now narrowed it down some more. I am not sure, this is the only situation, but at least this is one I do observe most often:

- no problems with event or worker MPM

- no problems on REHL and SLES (Linux)

- even on Solaris, running without mod_watchdog (and therefore without mod_heartbeat, mod_heartmonitor, mod_lbmethod_heartbeat and mod_proxy_hcheck) it seems the hang does not occur.

- as soon as I enable mod_watchdog only (but not the above modules that would use it), the hangs start to happen every now and then.

The pattern seems to be (from the error log):

[Sat Jan 26 21:36:01.855207 2019] [core:trace4] [pid 7850] mpm_common.c(536): mpm child 7853 (gen 0/slot 0) started [Sat Jan 26 21:36:01.875826 2019] [mpm_prefork:notice] [pid 7850] AH00163: Apache/2.4.38 (Unix) OpenSSL/1.0.2 configured -- resuming normal operations [Sat Jan 26 21:36:01.875975 2019] [mpm_prefork:info] [pid 7850] AH00164: Server built: Jan 18 2019 17:54:50 [Sat Jan 26 21:36:01.876208 2019] [core:notice] [pid 7850] AH00094: Command line: '/path/to/bin/httpd -d /path/to/t -f /path/to/t/conf/httpd.conf -D APACHE2 -D APACHE2_4' [Sat Jan 26 21:36:01.876340 2019] [core:debug] [pid 7850] log.c(1571): AH02639: Using SO_REUSEPORT: no (1) [Sat Jan 26 21:36:01.876417 2019] [mpm_prefork:debug] [pid 7850] prefork.c(923): AH00165: Accept mutex: pthread (default: pthread) [Sat Jan 26 21:36:01.875753 2019] [core:trace4] [pid 7850] mpm_common.c(536): mpm child 7854 (gen 0/slot 1) started
...
[Sat Jan 26 21:36:18.969742 2019] [core:trace4] [pid 7850] mpm_common.c(536): mpm child 7889 (gen 0/slot 2) started
...
SHOULD BE FINE UNTIL HERE
...
[Sat Jan 26 21:38:29.016961 2019] [mpm_prefork:error] [pid 7850] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
...
UNCLEAR, WHETHER THAT IS UNEXPECTED, BUT THEN THE REAL ERROR
...
[Sat Jan 26 21:42:48.213005 2019] [mpm_prefork:emerg] [pid 7889] (45)Deadlock situation detected/avoided: AH00144: couldn't grab the accept mutex [Sat Jan 26 21:42:48.490884 2019] [mpm_prefork:emerg] [pid 7853] (45)Deadlock situation detected/avoided: AH00144: couldn't grab the accept mutex
...
[Sat Jan 26 21:42:49.129401 2019] [core:alert] [pid 7850] AH00050: Child 7853 returned a Fatal error... Apache is exiting! [Sat Jan 26 21:42:49.129519 2019] [:emerg] [pid 7850] AH02818: MPM run failed, exiting

bin/httpd -V
Server version: Apache/2.4.38 (Unix)
Server built:   Jan 18 2019 17:54:50
Server's Module Magic Number: 20120211:83
Server loaded:  APR 1.6.5, APR-UTIL 1.6.1
Compiled using: APR 1.6.5, APR-UTIL 1.6.1
Architecture:   32-bit
Server MPM:     prefork
   threaded:     no
     forked:     yes (variable process count)
Server compiled with....
  -D APR_HAS_SENDFILE
  -D APR_HAS_MMAP
  -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
  -D APR_USE_PROC_PTHREAD_SERIALIZE
  -D APR_USE_PTHREAD_SERIALIZE
  -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
  -D APR_HAS_OTHER_CHILD
  -D AP_HAVE_RELIABLE_PIPED_LOGS
  -D DYNAMIC_MODULE_LIMIT=256
  -D HTTPD_ROOT="..."
  -D SUEXEC_BIN=".../bin/suexec"
  -D DEFAULT_PIDLOG="logs/httpd.pid"
  -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
  -D DEFAULT_ERRORLOG="logs/error_log"
  -D AP_TYPES_CONFIG_FILE="conf/mime.types"
  -D SERVER_CONFIG_FILE="conf/httpd.conf"

I remember Solaris having a very bad deadlock detection but that goes back to using fcntl locks. Here we are using pthread locks. Since this is APR 1.6, my old observation about pthread timed lock problems on Solaris 10 also does not apply here.

I wonder what is special about prefork and mod_watchdog.

In addition, what I find strange is, that the accept mutex error lead to child crashes, but not to a start of new child processes. Instead one of the two child crashes (the second one) seems to be noticed by the parent:

[core:alert] [pid 7850] AH00050: Child 7853 returned a Fatal error... Apache is exiting!
[:emerg] [pid 7850] AH02818: MPM run failed, exiting

and in fact the parent goes away but the third and last child process stays and continues to handle requests. That doesn't seem right.

I also see complains about reaching MaxRequestWorkers in the prefork tests on Solaris. I haven't verified it, but it seems they run with a fixed limit of three child processes. On the one hand, I would only expect two of them to be actually used at any time (request plus proxied request), since the tests are not executed in parallel. So the message is a bit strange. On the other hand the limit three (if I am right) seems very limited.

Any hints, ideas, comments about prefork plus mod_watchdog, the incomplete shutdown after AH00050 plus AH02818, the limit of 3 child processes and the MaxRequestWorkers warning?

Questions, questions, ... ;)

Regards,

Rainer

Reply via email to