Ninja edit: While the restart problem occurs 100% of the time at load, with reduced concurrency (threads=1), it was only reproducible 1/10th of the time. Rate-limiting traffic before hitting the daemons also had some benefit, maybe 80% failure.
On Friday, November 18, 2016 at 7:13:35 AM UTC-5, [email protected] wrote: > > Thanks Graham. > > They look pretty normal: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 2673 0.0 0.0 61272 3276 ? Ss 04:09 0:01 > /usr/sbin/httpd.worker > apache 1201 0.1 0.0 739448 8668 ? Sl 06:32 0:01 > /usr/sbin/httpd.worker > svcuser 12840 0.0 0.0 455436 22876 ? Sl 05:19 0:03 > daemon-display-name > *svcuser 23339 0.0 0.0 237320 5392 ? Sl Nov17 0:00 > daemon-display-name <-- orphan* > > Note that we do *not* see the pids of our daemon workers in the apache log > when it shuts down. We only see the pids of non-modwsgi workers, for > handling server-status et al. So in above output we would see only pid > 1201 shutdown problems in httpd log. > > This issue has been around for a while, we have observed it here and there > in the past, but recently it has amplified and is causing resource > exhaustion and we're trying to answer 'why now' in addition to 'why'? > > > Appreciate the help. > > > On Thursday, November 17, 2016 at 11:20:12 PM UTC-5, Graham Dumpleton > wrote: >> >> >> > On 18 Nov 2016, at 2:39 PM, [email protected] wrote: >> > >> > Hello, >> > >> > We are having an issue using Apache/2.2.15 (Unix) mod_wsgi/3.3 >> Python/2.7.3 worker MPM/daemon mode, where apache restarts cause daemon >> processes to become orphaned (adopt ppid 1 and continue to run app code but >> not take http requests). >> > >> > Each time the error occurs, we will see something like: >> > [Thu Nov 17 22:15:00 2016] [warn] child process 23371 still did not >> exit, sending a SIGTERM >> > [Thu Nov 17 22:15:02 2016] [warn] child process 23371 still did not >> exit, sending a SIGTERM >> > [Thu Nov 17 22:15:04 2016] [warn] child process 23371 still did not >> exit, sending a SIGTERM >> > [Thu Nov 17 22:15:06 2016] [error] child process 23371 still did not >> exit, sending a SIGKILL >> > >> > .. where pid 23371 was an httpd worker. >> > >> > This causes me to assume that the root worker (initial process spawned >> by httpd and owned by root) sends (TERM, TERM, TERM, KILL) to the >> worker(s), which then attempts to kill the daemon processes but can't for >> some reason and that causes it to not respond to it's parent's requests to >> die. However, this does not make sense to me because that worker is run by >> low-privilege apache user which does not have ability to kill our daemon >> processes (which have a different uid/gid). We have tried permutations of >> different users and privileges and nothing helps. >> > >> > We can easily send a TERM to any of the daemon processes manually >> (orphaned or not), and they die cleanly in well under the 3 second window >> that apache uses. They die, and mod_wsgi emits something to the httpd log >> saying they were aborted. It just doesn't happen when httpd tries to do >> it. >> > >> > We are using C modules, and we have enabled WSGIApplicationGroup >> ${GLOBAL} and as far as we can tell our permissions and vhost configuration >> is right. The application works well at runtime. >> > >> > In order to continue to debug this, we were hoping to find out exactly >> how the daemons are signaled that they should exit. Tracing the daemon >> processes with sysdig shows nothing about them getting any signals from >> httpd to terminate. >> > >> > Any ideas or tips on how to put the pieces together? >> >> The signals to shutdown should be sent by the Apache root process, which >> runs as root. There is no way the daemon processes should be able to ignore >> the SIGKILL. The only way the processes should be able to hang around is if >> they became zombie processes because they were hung on some resource such >> as an NFS mount. They will not actually be running in this case, only >> occupying a slot in the process table and nothing more. >> >> Really need to see the output of ‘ps auxwww’ so can see the pids, >> relationship to other httpd processes and the process state and whether it >> is a zombie (Z). >> >> Overall not much can do to help as you are on an ancient Apache/mod_wsgi >> version. From memory have seen some complaints of something similar before, >> but they all revolved around the user of Apache 2.2.12-2.2.16. Never seen >> anything similar since. So have always suspected some strange issue with >> Apache around that version. >> >> Graham >> >> -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
