Hi!

We've been having reoccurring problems with mpm_event and old workers/processes sometimes sticking around after server reloads due to cert update (most often due to using LetsEncrypt certs) and config changes (more seldom). At least, that's what we think is happening.

I seem to remember fixes on this theme some time ago, but it doesn't seem to be perfect yet, at least not in our setup.

I currently see this on a Ubuntu 18.04 machine, from server-status (raw cut&paste, so the formatting isn't the best):

--------------------8<---------------------------
Server Version: Apache/2.4.46 (Unix) OpenSSL/1.1.1
Server MPM: event
Server Built: Oct 21 2020 13:48:44

Current Time: Tuesday, 30-Mar-2021 16:13:23 CEST
Restart Time: Sunday, 21-Feb-2021 12:15:56 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 37 days 2 hours 57 minutes 26 seconds
Server load: 0.58 1.45 1.51
Total accesses: 266154607 - Total Traffic: 51438.2 GB - Total Duration: 
24778100508
CPU Usage: u64484.4 s43175.7 cu226122 cs121786 - 14.2% CPU load
83 requests/sec - 16.4 MB/second - 202.7 kB/request - 93.0966 ms/request
13 requests currently being processed, 762 idle workers

Slot    PID     Stopping        Connections     Threads Async connections
total   accepting       busy    idle    writing keep-alive      closing
0       970153  no (old gen)    2       yes     1       63      0       0       0
1       360080  no (old gen)    0       yes     0       64      0       0       0
2       770373  no (old gen)    0       yes     0       64      0       0       0
3       810318  no (old gen)    0       yes     0       64      0       0       0
4       921494  no      0       yes     0       64      0       0       0
5       970233  no (old gen)    0       yes     0       64      0       0       0
7       612077  no (old gen)    0       yes     0       64      0       0       0
8       49423   no (old gen)    0       yes     0       64      0       0       0
9       49521   no (old gen)    1       yes     0       64      0       0       0
10      955271  no (old gen)    0       yes     0       64      0       0       0
13      955426  no (old gen)    2       yes     0       64      0       0       0
14      154811  no      0       yes     0       64      0       0       0
15      558125  no (old gen)    3       yes     3       61      0       1       0
16      558205  no (old gen)    0       yes     0       64      0       0       0
17      603555  no (old gen)    2       yes     3       61      0       0       0
18      558451  no (old gen)    0       yes     0       64      0       0       0
19      587269  no (old gen)    0       yes     0       64      0       0       0
22      955577  no (old gen)    0       yes     0       64      0       0       0
24      538389  no      0       yes     0       64      0       0       0
26      538401  no      0       yes     0       64      0       0       0
28      538435  no      0       yes     0       64      0       0       0
36      538979  no      0       yes     0       64      0       0       0
51      540034  no      0       yes     0       64      0       0       0
60      540326  no      0       yes     0       64      0       0       0
62      540379  no      7       yes     6       58      0       0       0
66      540457  no      0       yes     0       64      0       0       0
73      540666  no      0       yes     0       64      0       0       0
75      540721  no      0       yes     0       64      0       0       0
Sum     28      0       17              13      1779    0       1       0
--------------------8<---------------------------

So, what strikes me as very odd is that we have a bunch of PIDs that are marked as "old gen", but are not stopping (and thus still accepting new connections). Shouldn't "old gen" processes by default stop accepting new connections?

Things become very unfun when the old processes sometimes processes connections while holding on to an expired LetsEncrypt certificate. Murphy ensures that our tests never hit the old pids, but users always do...

The start time of the "old gen" processes vary a bit:

--------------------8<---------------------------
970153 Mar28
360080 Mar22
770373 Mar19
810318 Mar28
970233 Mar28
612077 Mar28
 49423 Mar29
 49521 Mar29
955271 Mar28
955426 Mar28
558125 Mar20
558205 Mar20
603555 Mar20
558451 Mar20
587269 Mar28
955577 Mar28
--------------------8<---------------------------

Although I don't know if you can draw any conclusions from that...

In any case, I'm a bit in deep water trying to navigate where this problem might originate. And reproducing is hard, we're only seeing it occasionally...

Ideas?


/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     ni...@acc.umu.se
---------------------------------------------------------------------------
 The greatest productive force is selfishness!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Reply via email to