Okay, here is the history of this issue. I have seen reports of it probably less than a dozen times that I recollect. If more, it would not be many more.
Although it is mod_wsgi that triggers the shutdown of the daemon process, it is process management code within the Apache Runtime Library that deals with the consequences of the child processing disappearing. The only thing mod_wsgi provides is some simple callback code to be notified when a process has died and depending on the circumstances restart it. The only case one shouldn't be restarted to replace one which has died is if Apache is being restarted or shutdown. These two recent reports are the first time I have seen the issue brought up for a long time. Previously, based on the fact that reports of the problem went away, it was suspected that it was some Apache issue that was fixed around Apache version 2.2.16/2.2.17 due to a change in the version of the Apache Runtime Library being used at the time. Michael, in your case you say you are using Apache 2.2.3 and Greg you said you were using stock Debian version. Debian has a history of not keeping up with the latest versions of software, so depending on what Debian version you are using, you could also be using an old version of Apache. So that is about all I can tell you at the moment. That is, that I would suspect that it is going to be due to an older version of Apache being used. The issue just hasn't been seen since about Apache 2.2.16/2.2.17. Sorry about the slow replies. Life is confusing right now and finding time and motivation to keep on top of the few mod_wsgi questions that do come up is proving challenging. Graham On 4 January 2013 10:06, Michael Bayer <[email protected]> wrote: > We're seeing behavior that also might be this same issue. This is again > a low-use server with a maximum-requests of 1000, which is set that way > since we were observing slow memory leaks in one application. We're on > Apache 2.2.3, I'm not seeing here what version of mod_wsgi we're on (unless > there's some way to divine that from the .so file). > > The symptom is, the server serves its last request, and the next one is a > 500 server error. The app server doesn't seem to play any role at that > point and the error log has the error "Premature end of script headers: > ". It's not clear to me without asking the admins some more if the server > no longer writes any more logs after that, as well as whether or not the > daemon process is still present in the ps listing. > > > > > On Tuesday, October 2, 2012 1:40:16 AM UTC-4, Graham Dumpleton wrote: > >> Sorry for the slow reply, been on long holiday. >> >> Why are you using maximum-requests and inactivity-timeout? >> >> I would always recommended against using these options if there is no >> real requirement as unnecessary restarts should always be avoided >> because the startup costs of loading up fat Python web applications >> can significantly impact server performance if your server is under >> constant load. >> >> IOW, the last thing you want to happen if you are under loading, >> especially a load spike, is to restart the application due to hitting >> maximum-requests. >> >> Anyway, I don't recollect specifically a case where daemon process >> wouldn't come back after being killed off. Are you absolutely sure, >> from running 'ps' and checking process IDs that they didn't exist in >> any state in the process table? Are the process perhaps still there >> and haven't actually shutdown properly? >> >> The message: >> >> [Fri Sep 21 09:10:23 2012] [info] mod_wsgi (pid=3845): Aborting process >> 'XXusername-omittedXX'. >> >> indicates there certainly wasn't a clean shutdown, possibly because of >> a threading dead lock or hung request thread. However, when you get >> that message there is a C level call to exit() immediately after so >> the process should die. >> >> Technically calling exit() doesn't absolutely guarantee shutdown as a >> C level atexit() handler could block but neither Apache or mod_wsgi >> should be setting up any of those. >> >> The problem with maximum-requests and inactivity-timeout is that they >> are a self initiated restart and a graceful one at that. There is >> therefore no absolute failsafe if exit() doesn't work as there would >> if it was an Apache restart. In the case of an Apache restart, the >> Apache parent process will send a SIGKILL to the child processes if >> they haven't exited after 3 seconds. For a self initiated restart, the >> parent process doesn't know it is restarting so if something prevents >> the process from properly dying, then it will not be notified and >> replace the process. >> >> In summary, validate for sure whether the daemon process exist still >> or not in the process table and in what state. >> >> If they do still exist, try and use gdb as described in: >> >> http://code.google.com/p/**modwsgi/wiki/**DebuggingTechniques#Debugging_* >> *Crashes_With_GDB<http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB> >> >> to work out where they are blocked. >> >> That all said, don't recollect seeing this specific issue with >> maximum-requests, have seen similar outcome of processes not handling >> requests in the past. Generally this was always because of application >> level code blocking exhausting threads. There were some unexplained >> cases, but this seemed to vanish with no changes to mod_wsgi. There >> was suspicion though at the time that for some reason the distro >> Apache was causing problems. From memory this was for Apache in patch >> version range around 15-17 of 2.2 and you are using 16. Is there a >> reason why you can't upgrade to a newer Apache version than what you >> are using? >> >> Graham >> >> On 24 September 2012 10:19, Greg Lowe <[email protected]> wrote: >> > I've recently run into some stability issues with three django sites >> that we >> > host. These are running on Apache and Mod_wsgi, and are low traffic >> sites. >> > I’m more concerned with reliability than maximising throughput. >> > >> > I’ve included a summary of what has been observed so far below. I would >> > greatly appreciate any advice on how to diagnose and/or fix this >> problem. >> > >> > The sites are configured to use daemon processes with a >> maximum-requests >> > setting. Apache is running in prefork mode. i.e. WSGIDaemonProcess >> > usernamehere user=usernamehere group=usernamehere processes=2 >> threads=25 >> > maximum-requests=100 inactivity-timeout=60 display-name=%{GROUP} >> > >> > We’ve experienced a number of outages where both of the daemon >> processes >> > disappear, and are not recreated by mod_wsgi. Restarting Apache causes >> them >> > to be restarted correctly. >> > >> > Occasionally I’ve seen the sites being served from only a single daemon >> > process although it’s configured to use two. An apache restart fixes >> this. >> > >> > Our server admin has managed to reproduce the problem on our test >> > environment by running apache bench at around 100 requests per second >> for >> > ~70k requests. (He’s also done this with INFO level logging turned on - >> see >> > the log snippets at the end of the email.) >> > >> > These are the typical errors seen in the error log from an outage: >> > >> > Many of these: Script timed out before returning headers: django.wsgi >> > >> > Many of these: Premature end of script headers: django.wsgi >> > >> > Occasionally one of these: (4)Interrupted system call: mod_wsgi >> (pid=19990): >> > Unable to connect to WSGI daemon process 'usernamehere’' on >> > '/var/run/apache2/wsgi.25099.**19.3.sock' after multiple attempts. >> > >> > Occasionally one of these: (2)No such file or directory: mod_wsgi >> > (pid=10877): Unable to connect to WSGI daemon process 'usernamehere' on >> > '/var/run/apache2/wsgi.5194.**16.3.sock' after multiple attempts. >> > >> > >> > One of the sites has been running for about a year, the others only a >> few >> > months. These problems first started a few weeks ago. >> > >> > The sites were migrated from a server running an older version of >> Debian a >> > little over a month ago. There was a few weeks in between the migration >> and >> > the problems appearing. So I’m fairly confident that we can attribute >> these >> > problems to the newer version of mod_wsgi, but due to the timing I >> can’t be >> > certain. >> > >> > Old server versions: >> > >> > Debian: Lenny >> > >> > Apache: apache2 2.2.9-10+lenny8, apache-mpm-prefork 2.2.9-10+lenny8 >> > >> > mod_wsgi: libapache2-mod-wsgi 2.5-1~lenny1 >> > >> > Django: 1.3.1 >> > >> > >> > Current server versions: >> > >> > Debian: Squeeze >> > >> > Apache: apache2 2.2.16-6+squeeze7, apache2-mpm-prefork >> 2.2.16-6+squeeze7 >> > >> > Mod_wsgi: libapache2-mod-wsgi 3.3-2 >> > >> > Django: 1.3.1 >> > >> > >> > Our system admin has reproduced the problem on our test environment >> with >> > INFO level logging turned on. Here are his comments: >> > >> > I set the apache log level for the test site and ab'd it to death. >> > >> > The pertinent log entries are: >> > >> > [Fri Sep 21 09:10:22 2012] [info] Initial (No.1) HTTPS request received >> > for child 70 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:10:22 2012] [info] mod_wsgi (pid=3845): Maximum requests >> > reached 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:22 2012] [info] mod_wsgi (pid=3845): Shutdown >> > requested 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:22 2012] [info] Initial (No.1) HTTPS request received >> > for child 27 (server XXXXXomittedXXXXXX:443) >> > .... >> > [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received >> > for child 28 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:10:23 2012] [info] mod_wsgi (pid=3845): Aborting process >> > 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:23 2012] [error] [client 192.168.146.19] Premature >> end >> > of script headers: django.wsgi >> > [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received >> > for child 85 (server XXXXXomittedXXXXXX:443) >> > .... >> > [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received >> > for child 46 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:15:20 2012] [error] [client 192.168.146.19] Script timed >> > out before returning headers: django.wsgi >> > [Fri Sep 21 09:15:20 2012] [error] [client 192.168.146.19] Script timed >> > out before returning headers: django.wsgi >> > >> > So it seems that there is an unclean shutdown after wsgi's max requests >> > count is reached. >> > >> > A normal shutdown looks like this: >> > >> > [Fri Sep 21 09:10:17 2012] [info] Initial (No.1) HTTPS request received >> > for child 73 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:10:17 2012] [info] mod_wsgi (pid=3691): Maximum requests >> > reached 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:17 2012] [info] mod_wsgi (pid=3691): Shutdown >> > requested 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:17 2012] [info] Initial (No.1) HTTPS request received >> > for child 79 (server XXXXXomittedXXXXXX:443) >> > .... >> > [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received >> > for child 27 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Stopping process >> > 'XXusername-omittedXX'. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Destroying >> > interpreters. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Destroy >> > interpreter 'XXXXXomittedXXXXXX|'. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Cleanup >> > interpreter ''. >> > [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received >> > for child 61 (server XXXXXomittedXXXXXX:443) >> > .... >> > >> > [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received >> > for child 19 (server XXXXXomittedXXXXXX:443) >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Terminating >> Python. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Python has >> shutdown. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3845): Attach >> > interpreter ''. >> > [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3845): Create >> > interpreter 'XXXXXomittedXXXXXX|'. >> > [Fri Sep 21 09:10:18 2012] [info] [client 192.168.146.19] mod_wsgi >> > (pid=3845, process='XXusername-omittedXX'**, >> > application='**XXXXXomittedXXXXXX|'): Loading WSGI script >> > '/home/XXusername-omittedXX/**scripts/django.wsgi'. >> > [Fri Sep 21 09:10:19 2012] [info] Initial (No.1) HTTPS request received >> > for child 85 (server XXXXXomittedXXXXXX:443) >> > >> > >> > Thanks for making all the way through this long post ;) >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups >> > "modwsgi" group. >> > To view this discussion on the web visit >> > https://groups.google.com/d/**msg/modwsgi/-/NSg2A39f7DYJ<https://groups.google.com/d/msg/modwsgi/-/NSg2A39f7DYJ>. >> >> > To post to this group, send email to [email protected]. >> > To unsubscribe from this group, send email to >> > modwsgi+u...@**googlegroups.com. >> > For more options, visit this group at >> > http://groups.google.com/**group/modwsgi?hl=en<http://groups.google.com/group/modwsgi?hl=en>. >> >> > -- > You received this message because you are subscribed to the Google Groups > "modwsgi" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/modwsgi/-/Lc4lrAOyyvwJ. > > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/modwsgi?hl=en. > -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/modwsgi?hl=en.
