Re: Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Rainer Jung Mon, 30 Mar 2020 02:38:09 -0700

Am 30.03.2020 um 11:28 schrieb Joe Orton:

On Mon, Mar 30, 2020 at 12:31:05AM +0200, Rainer Jung wrote:

I can now see the same problem on Linux (eg. RHEL 7, SLES 12 and SLES 15)
when doing testing for 2.4.43. I think it is not a regression and for me it
is not a showstopper, but something that would be nice to fix. Symptom is


[Sat Mar 28 15:53:21.121461 2020] [mpm_prefork:debug] [pid 4574]
prefork.c(919): AH00165: Accept mutex: pthread (default: pthread)
[Sat Mar 28 15:58:19.251858 2020] [mpm_prefork:emerg] [pid 6509] (22)Invalid
argument: AH00144: couldn't grab the accept mutex
[Sat Mar 28 15:58:20.624995 2020] [mpm_prefork:emerg] [pid 4902] (22)Invalid
argument: AH00146: couldn't release the accept mutex
[Sat Mar 28 15:58:21.410552 2020] [:emerg] [pid 4574] AH02818: MPM run
failed, exiting

happening during t/modules/ext_filter.t and as a result all httpd processes
are gone.

Although it happens only rarely, I think it does not happen when using APR
1.6, but only for 1.7 (1.7.0 and also for latest head). The accept mutex is
pthread based. There are changes in APR 1.7 for those.


This is interesting, I saw some similar mutex state snafu (Fedora 31)
but assumed it was some kind of PEBKAC issue and moved on.  You have
only seen this with prefork?

Yes, only with prefork. If the root cause would also apply formulti-threaded MPMs, then the escalation might be different, ie. notleading to all child processes exiting. I can not rule out this.

May be completely unrelated but since I added a grep for child segfaults
in error_log that has triggered with prefork at least once:

https://travis-ci.org/github/apache/httpd/jobs/665874906

That looks more like my rarely observed prefork shutdown crashes whenmod_watchdog shuts down after some other deinitialization that happenedbefore.


Regards,

Rainer

Regards, Joe

Since I only observe it for mod_ext_filter, there should be some dependency
with the forked perl process.

I didn't yet have the opportunity to check, how much of the follwing
description for Solaris also holds for Linux.

Regards,

Rainer

Am 03.02.2019 um 13:30 schrieb Rainer Jung:

I can now frequently reproduce running t/modules/ext_filter.t only. I
stripped the reproducer down to the part of t/modules/ext_filter.t where
it runs

POST "/apache/extfilter/out-limit/modules/cgi/perl_echo.pl", content =>
"foo and bar\n"

10 times in a row. If I increase the iteration count slightly to 20, the
problem happens nearly every time. I also replaced perl_echo.pl and
eval-cmd.pl by small C programs doing the echo and the s/foo/bar/ and
can still reproduce.

This test involved mod_ext_filter and LimitRequestBody.

It seems I can not reproduce, if I shorten the POST body, so that it no
longer hits the LimitRequestBody 6 configured for
/apache/extfilter/out-limit/.

What happens, but I do not understand is:

- the first few requests are handled by two children in an alternating
way. I can see the accept mutex being passed between these two children
using lwp_mutex_timedlock(ADDRESS, 0x00000000) and
lwp_mutex_unlock(ADDRESS) using truss (Solaris variant of strace). So
Solaris seems to implement the pthread mutexes via these lwp mutexes.

- after a few requests alternating between the two children, something
strange happens:

    - child A returns from port_getn() for 22 (probably part of the
accept() impl)
    - child A calls accept()
    - child A unlocks the accept mutex using lwp_mutex_unlock()
    - child B locks the accept mutex using lwp_mutex_timedlock()
    - child B calls port_associate for 22 (probably part of accept() impl)
    - child A handles the request
! - child A also calls port_associate for 22 (no sign of lock acquire)
! - child A returns from port_getn() for 22
    - child A calls accept() and starts to handle the connection
! - child B also returns from port_getn() for 22
! - child B gets EAGAIN for the accept()
    - child B calls port_associate for 22
    - child A handles the request
! - child A gets EDEADLK for pthread_mutex_lock() (this is from the
error_log; there's no system call for this instance of
pthread_mutex_lock() in the truss output). EDEADLK for
pthread_mutex_lock() means that this process already has that lock.
    - child B returns from port_getn() for 22
    - child B calls accept() and starts to handle the connection
    - child A exits (now B is the only child left)
    - child B proceeds request handling
    - child B does all further port_associate(), port_getn(), and
accept() plus connection and request handling. No more
lwp_mutex_timedlock() or lwp_mutex_unlock() for B, maybe due to some
optimization when only one process for a lock is left.


It is quite possible, that there is an underlying lwp_mutex or portfs
bug here. But it is strange, that this only comes up when used with
mod_ext_filter in combination with LimitRequestBody.

There was the fix

https://bz.apache.org/bugzilla/show_bug.cgi?id=60375

but I don't see, how this would influence the above list.

I can try to further narrow down (using static content to eliminate one
forked child; using LimitRequestBody without mod_exit-filter etc.), but
maybe someone already has an idea?

Regards,

Rainer

Re: Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Reply via email to