Hi!
We've been having reoccurring problems with mpm_event and old
workers/processes sometimes sticking around after server reloads due
to cert update (most often due to using LetsEncrypt certs) and config
changes (more seldom). At least, that's what we think is happening.
I seem to remember fixes on this theme some time ago, but it doesn't
seem to be perfect yet, at least not in our setup.
I currently see this on a Ubuntu 18.04 machine, from server-status
(raw cut&paste, so the formatting isn't the best):
--------------------8<---------------------------
Server Version: Apache/2.4.46 (Unix) OpenSSL/1.1.1
Server MPM: event
Server Built: Oct 21 2020 13:48:44
Current Time: Tuesday, 30-Mar-2021 16:13:23 CEST
Restart Time: Sunday, 21-Feb-2021 12:15:56 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 37 days 2 hours 57 minutes 26 seconds
Server load: 0.58 1.45 1.51
Total accesses: 266154607 - Total Traffic: 51438.2 GB - Total Duration:
24778100508
CPU Usage: u64484.4 s43175.7 cu226122 cs121786 - 14.2% CPU load
83 requests/sec - 16.4 MB/second - 202.7 kB/request - 93.0966 ms/request
13 requests currently being processed, 762 idle workers
Slot PID Stopping Connections Threads Async connections
total accepting busy idle writing keep-alive closing
0 970153 no (old gen) 2 yes 1 63 0 0 0
1 360080 no (old gen) 0 yes 0 64 0 0 0
2 770373 no (old gen) 0 yes 0 64 0 0 0
3 810318 no (old gen) 0 yes 0 64 0 0 0
4 921494 no 0 yes 0 64 0 0 0
5 970233 no (old gen) 0 yes 0 64 0 0 0
7 612077 no (old gen) 0 yes 0 64 0 0 0
8 49423 no (old gen) 0 yes 0 64 0 0 0
9 49521 no (old gen) 1 yes 0 64 0 0 0
10 955271 no (old gen) 0 yes 0 64 0 0 0
13 955426 no (old gen) 2 yes 0 64 0 0 0
14 154811 no 0 yes 0 64 0 0 0
15 558125 no (old gen) 3 yes 3 61 0 1 0
16 558205 no (old gen) 0 yes 0 64 0 0 0
17 603555 no (old gen) 2 yes 3 61 0 0 0
18 558451 no (old gen) 0 yes 0 64 0 0 0
19 587269 no (old gen) 0 yes 0 64 0 0 0
22 955577 no (old gen) 0 yes 0 64 0 0 0
24 538389 no 0 yes 0 64 0 0 0
26 538401 no 0 yes 0 64 0 0 0
28 538435 no 0 yes 0 64 0 0 0
36 538979 no 0 yes 0 64 0 0 0
51 540034 no 0 yes 0 64 0 0 0
60 540326 no 0 yes 0 64 0 0 0
62 540379 no 7 yes 6 58 0 0 0
66 540457 no 0 yes 0 64 0 0 0
73 540666 no 0 yes 0 64 0 0 0
75 540721 no 0 yes 0 64 0 0 0
Sum 28 0 17 13 1779 0 1 0
--------------------8<---------------------------
So, what strikes me as very odd is that we have a bunch of PIDs that
are marked as "old gen", but are not stopping (and thus still
accepting new connections). Shouldn't "old gen" processes by default
stop accepting new connections?
Things become very unfun when the old processes sometimes processes
connections while holding on to an expired LetsEncrypt certificate.
Murphy ensures that our tests never hit the old pids, but users always
do...
The start time of the "old gen" processes vary a bit:
--------------------8<---------------------------
970153 Mar28
360080 Mar22
770373 Mar19
810318 Mar28
970233 Mar28
612077 Mar28
49423 Mar29
49521 Mar29
955271 Mar28
955426 Mar28
558125 Mar20
558205 Mar20
603555 Mar20
558451 Mar20
587269 Mar28
955577 Mar28
--------------------8<---------------------------
Although I don't know if you can draw any conclusions from that...
In any case, I'm a bit in deep water trying to navigate where this
problem might originate. And reproducing is hard, we're only seeing it
occasionally...
Ideas?
/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | ni...@acc.umu.se
---------------------------------------------------------------------------
The greatest productive force is selfishness!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=