https://bz.apache.org/bugzilla/show_bug.cgi?id=63975

            Bug ID: 63975
           Summary: Some processes never terminate after graceful restart
           Product: Apache httpd-2
           Version: 2.4.38
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: mpm_worker
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

Hello,

we have encountered a problem which manifests itself mostly on busy webservers.
The system accumulates inactive (sleeping) Apache processes over time and
eventually becomes unable to serve requests as no more workers can be spawned
by the main Apache process. We are not seeing this behaviour on older servers,
so I would guess this was introduced between Apache versions 2.4.25 (as found
in Debian Stretch) and 2.4.38 (Debian Buster)

The specific error appearing in the log file when the server fails is
"2019-11-27_20:06:09 (11)Resource temporarily unavailable: AH03142:
apr_thread_create: unable to create worker thread". There is no mention about
which specific resource was exhausted, my guess would be some
process/thread-related limit, because at that point the server usually runs
hundreds of Apache processes and thousands or tens of thousands threads
(despite ServerLimit, which is set to 64.)

Similar behaviour can be observed with event MPM - it seems to honour
ServerLimit, but most servers are in "Stopping: yes (old gen)" state and
eventually the error log get flooded with "AH03490: scoreboard is full, not at
MaxRequestWorkers.Increase ServerLimit." and no requests are served.

The only other suspicious or unusual error we found in the log is
"2019-11-19_18:16:37 AH00291: long lost child came home!", I am not sure if it
is relevant though

I tried to analyze what the inactive process is doing. According to ls -l on
its /proc/pid/fd directory is has one socket opened:

lrwx------ 1 root root 64 Nov 28 11:33 24 -> 'socket:[350469138]'

There is no network connection associated with that however: lsof -n | grep
350469138

apache2   46425                                      www-data   24u     sock   
            0,9       0t0  350469138 protocol: TCP
apache2   46425 46448 apache2                        www-data   24u     sock   
            0,9       0t0  350469138 protocol: TCP
apache2   46425 46453 apache2                        www-data   24u     sock   
            0,9       0t0  350469138 protocol: TCP

(lsof would show source and destination IP/port for opened connection)

According to their /proc/pid/syscall, the process and threads are mostly
waiting in futex(), sometimes in read() or in epoll_wait(). According to strace
there is no activity in the process nor its threads.

Processes stuck in this state are not responding to any attempts to terminate
them via TERM or INT signal, they need to be KILLed. However, they are not
completely stuck, they terminate when the Apache main process is terminated
(which solves the issue but introduces disruption of service, albeit a short
one.)

Would it be possible - if the bug proves impossible to isolate - to hard-cap
the time a process takes during its graceful termination? This process I
analyzed was over 12 hours old and I suspect that no client was waiting for any
data from it for most of that time.

Thanks for looking into this

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to