Ninja edit: While the restart problem occurs 100% of the time at load, with 
reduced concurrency (threads=1), it was only reproducible 1/10th of the 
time.  Rate-limiting traffic before hitting the daemons also had some 
benefit, maybe 80% failure.  



On Friday, November 18, 2016 at 7:13:35 AM UTC-5, [email protected] 
wrote:
>
> Thanks Graham.
>
> They look pretty normal:
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      2673  0.0  0.0  61272  3276 ?        Ss   04:09   0:01 
> /usr/sbin/httpd.worker
> apache    1201  0.1  0.0 739448  8668 ?        Sl   06:32   0:01 
> /usr/sbin/httpd.worker
> svcuser    12840  0.0  0.0 455436 22876 ?        Sl   05:19   0:03 
> daemon-display-name
> *svcuser    23339  0.0  0.0 237320  5392 ?        Sl   Nov17   0:00 
> daemon-display-name  <-- orphan*
>
> Note that we do *not* see the pids of our daemon workers in the apache log 
> when it shuts down.  We only see the pids of non-modwsgi workers, for 
> handling server-status et al.  So in above output we would see only pid 
> 1201 shutdown problems in httpd log.  
>
> This issue has been around for a while, we have observed it here and there 
> in the past, but recently it has amplified and is causing resource 
> exhaustion and we're trying to answer 'why now' in addition to 'why'?
>
>
> Appreciate the help.
>
>
> On Thursday, November 17, 2016 at 11:20:12 PM UTC-5, Graham Dumpleton 
> wrote:
>>
>>
>> > On 18 Nov 2016, at 2:39 PM, [email protected] wrote: 
>> > 
>> > Hello, 
>> > 
>> > We are having an issue using Apache/2.2.15 (Unix) mod_wsgi/3.3 
>> Python/2.7.3 worker MPM/daemon mode, where apache restarts cause daemon 
>> processes to become orphaned (adopt ppid 1 and continue to run app code but 
>> not take http requests).   
>> > 
>> > Each time the error occurs, we will see something like: 
>> > [Thu Nov 17 22:15:00 2016] [warn] child process 23371 still did not 
>> exit, sending a SIGTERM 
>> > [Thu Nov 17 22:15:02 2016] [warn] child process 23371 still did not 
>> exit, sending a SIGTERM 
>> > [Thu Nov 17 22:15:04 2016] [warn] child process 23371 still did not 
>> exit, sending a SIGTERM 
>> > [Thu Nov 17 22:15:06 2016] [error] child process 23371 still did not 
>> exit, sending a SIGKILL 
>> > 
>> > .. where pid 23371 was an httpd worker. 
>> > 
>> > This causes me to assume that the root worker (initial process spawned 
>> by httpd and owned by root) sends (TERM, TERM, TERM, KILL) to the 
>> worker(s), which then attempts to kill the daemon processes but can't for 
>> some reason and that causes it to not respond to it's parent's requests to 
>> die.  However, this does not make sense to me because that worker is run by 
>> low-privilege apache user which does not have ability to kill our daemon 
>> processes (which have a different uid/gid).  We have tried permutations of 
>> different users and privileges and nothing helps. 
>> > 
>> > We can easily send a TERM to any of the daemon processes manually 
>> (orphaned or not), and they die cleanly in well under the 3 second window 
>> that apache uses.  They die, and mod_wsgi emits something to the httpd log 
>> saying they were aborted.  It just doesn't happen when httpd tries to do 
>> it. 
>> > 
>> > We are using C modules, and we have enabled WSGIApplicationGroup 
>> ${GLOBAL} and as far as we can tell our permissions and vhost configuration 
>> is right.  The application works well at runtime. 
>> > 
>> > In order to continue to debug this, we were hoping to find out exactly 
>> how the daemons are signaled that they should exit.  Tracing the daemon 
>> processes with sysdig shows nothing about them getting any signals from 
>> httpd to terminate.   
>> > 
>> > Any ideas or tips on how to put the pieces together? 
>>
>> The signals to shutdown should be sent by the Apache root process, which 
>> runs as root. There is no way the daemon processes should be able to ignore 
>> the SIGKILL. The only way the processes should be able to hang around is if 
>> they became zombie processes because they were hung on some resource such 
>> as an NFS mount. They will not actually be running in this case, only 
>> occupying a slot in the process table and nothing more. 
>>
>> Really need to see the output of ‘ps auxwww’ so can see the pids, 
>> relationship to other httpd processes and the process state and whether it 
>> is a zombie (Z). 
>>
>> Overall not much can do to help as you are on an ancient Apache/mod_wsgi 
>> version. From memory have seen some complaints of something similar before, 
>> but they all revolved around the user of Apache 2.2.12-2.2.16. Never seen 
>> anything similar since. So have always suspected some strange issue with 
>> Apache around that version. 
>>
>> Graham 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Reply via email to