Re: [modwsgi] mod_wsgi returning 503 service unavailable

Kelvin Wong Fri, 31 Oct 2014 01:28:12 -0700

Oh. I see. Thanks.

One more question.
As I set the graceful-timeout on mod_wsgi process, then when Apache do 
graceful restart, will mod_wsgi obey that timeout setting?



On Friday, October 31, 2014 4:16:55 PM UTC+8, Graham Dumpleton wrote:
>
>
>
> On 31/10/2014, at 6:53 PM, Kelvin Wong <[email protected] <javascript:>> 
> wrote:
>
> You means that the OS triggered Apache graceful restart? If so, why the OS 
> do that?
>
>
> Yes and I just told you why, as part of a log rotation mechanism.
>
> Try googling 'log file rotation' and see what mechanism is used on your 
> specific Linux system.
>
> One link about one such solution which is often used is:
>
> http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/
>
> Apache does itself have a way of handling log rotation itself, but likely 
> ignored by Linux distros which use their own preferred mechanism.
>
> Graham
>
>
> On Friday, October 31, 2014 3:45:35 PM UTC+8, Graham Dumpleton wrote:
>>
>> Your operating system could be signalling Apache during the middle of the 
>> night so that the log files can be rotated.
>>
>> Graham
>>
>> On 31/10/2014, at 6:36 PM, Kelvin Wong <[email protected]> wrote:
>>
>> I checked the memory of overall Apache process, it goes up and down a 
>> little. It Seemed the "maximum-requests" is doing its job.
>>
>> I found that when I do "sudo service apache reload", Apache will do 
>> graceful restart as it shown in main Apache error log.
>>
>>
>> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 
>> 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.
>> xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 
>> 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.
>> When before above situation happened, Apache also doing graceful restart.
>> But this should not be triggered by anyone as I mentioned it happened 
>> during sleep time.
>> Is there any reasons that Apache do graceful restart by itself?
>>
>> Thanks.
>>
>>
>> On Friday, October 31, 2014 2:55:32 PM UTC+8, Graham Dumpleton wrote:
>>
>> BTW, in your NR agent configuration file find the 
>> error_collector.ignore_errors setting and change it to:
>>
>> error_collector.ignore_errors = django.http.request:UnreadablePostError
>>
>> These indicate the HTTP client connection dropping before all content had 
>> been sent.
>>
>> Generally you want to ignore these and eliminate them from your error 
>> count as they do not represent an error in your application but are the 
>> result of a client problem.
>>
>> Adding that configuration setting will cause them to be ignored and not 
>> recorded as an error.
>>
>> Graham
>>
>> On 31/10/2014, at 5:41 PM, Graham Dumpleton <[email protected]> 
>> wrote:
>>
>>
>> On 31/10/2014, at 5:04 PM, Kelvin Wong <[email protected]> wrote:
>>
>> Thanks for the details!
>>
>> I now have a better picture on how it works.
>> I employed all the recommendations you suggested. I will keep monitoring 
>> the processes.
>>
>> But how about the graceful restart signaled by Apache? Why it happened?
>>
>>
>> As far as the message:
>>
>> [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 
>> 140182690981632] (2)No such 
>> file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): 
>> Unable to 
>> connect to WSGI daemon 
>> process 'snaptee-production-api-ssl' 
>> on'/var/run/apache2/wsgi.30188.7.3.sock'.
>>
>> goes, as I said I believe that could only be the result of an Apache 
>> graceful restart.
>>
>> My recollection is that a 'reload' under a Linux system using its process 
>> management tools triggers an Apache graceful restart where as a 'restart' 
>> does a normal Apache restart.
>>
>> I believe you said you used 'reload' and so that would be the source of 
>> that.
>>
>> The whole reason though I have been trying to sort out what processes 
>> actually had memory issues though was to avoid MaxConnectionsPerChild as 
>> well to ensure that that wasn't a source of the problem as well.
>>
>> Now that we have you on a better configuration lets see now whether it 
>> happens again, rather than worrying too much about what happened in the 
>> past. Just note when you happen to do a 'reload', and so understand that if 
>> seen at that time that that was likely the cause.
>>
>> Once things settle down for a while then can look at your 
>> processes/threads configuration and capacity utilisation and so work out 
>> whether a poor configuration there is causing problems. Can also see 
>> whether you are getting stuck requests as well which are eating into 
>> capacity and which could result in some of the errors you were seeing when 
>> any sort of restart was done.
>>
>> So slowly slowly. Resolve one issue at a time. Changing too many things 
>> at once will only confuse and you will not get a sense of what changes are 
>> actually making things better otherwise.
>>
>> Graham
>>
>>
>> Thanks again.
>>
>>
>> On Friday, October 31, 2014 12:29:12 PM UTC+8, Graham Dumpleton wrote:
>>
>> So lets look at just memory initially.
>>
>> When we talk about Apache processes, there are actually going to be three 
>> different classes of processes.
>>
>> These are:
>>
>> 1. The Apache parent process.
>>
>> This runs as root. Because it runs as root and not the Apache user 
>> (www-data) that the Apache child worker processes run as, then it can be 
>> seen separately on the NR server monitoring.
>>
>> There is no memory increase here. It is actually very important that the 
>> Apache parent process doesn't increase in size over time with restarts as 
>> when it forks child process they would inherit that increasing memory size. 
>> One of the reason that using mod_python is a bad idea is that it used to 
>> initialise the Python interpreter in the Apache parent process and when 
>> restarted, the Python interpreter would leak memory and so the Apache 
>> parent process would grow in size on each restart.
>>
>> Anyway, you are good here and nothing to worry about.
>>
>> 2. The Apache child worker processes.
>>
>> These run as the Apache user which on your system is www-data.
>>
>> The Apache child worker processes are what accepts all inbound HTTP 
>> requests. Static files will be served up by these processes.
>>
>> In your case where you are using mod_wsgi daemon processes, the Python 
>> web application does not run in the Apache child worker processes. The 
>> Apache child worker process still accept the requests initially though and 
>> would then proxy the request through to the daemon process running the WSGI 
>> application.
>>
>> 3. The mod_wsgi daemon processes.
>>
>> By default these also run as the Apache user which on your system is 
>> www-data.
>>
>> This is where your WSGI application is running.
>>
>> Now both the Apache child worker processes and the mod_wsgi daemon 
>> processes are forked children of the Apache parent process. They will 
>> therefore all show as being the process 'apache2' on your system.
>>
>> Because both (1) and (2) are running as the Apache user, NR server 
>> monitoring will not allow you to distinguish between them.
>>
>> Thus when you get:
>>
>> you aren't looking at just the Apache child worker processes and are also 
>> looking at the mod_wsgi daemon processes.
>>
>> The only way you could use NR server monitoring to tell between the two 
>> is if you used the 'user' option to WSGIDaemonProcess and ran the mod_wsgi 
>> daemon processes as a different user again. Then you could distinguish.
>>
>> Although you can't distinguish using NR server monitoring, you can on 
>> Linux if you ensure you are naming the processes. That is, when using 
>> WSGIDaemonProcess directive, supply the display-name=%{GROUP} option. This 
>> will allow you to see the mod_wsgi daemon processes as distinct if using 
>> 'ps' or 'htop'.
>>
>> Lets instead now look at what the mod_wsgi plugin tells you:
>>
>> Over the 24 hours matching last charts we get:
>>
>> so nothing conclusive. For a shorter period of one hour we get:
>>
>> This shows an increase over time, but you have to be a little bit careful 
>> in interpreting that. If you hover over the memory chart and look at the 
>> popup you will see:
>>
>> This shows that the processes actually range in size. There is a range of 
>> sizes because the process churn chart shows that processes are being 
>> restarted at different times.
>>
>> Important though is to look at the max value. This is the largest size of 
>> any process. If one moves along the chart for that period one sees that the 
>> maximum is always about 12.9-13.0 MB.
>>
>> This indicates to me that the Apache child worker process size has 
>> reached a plateau and that this is the normal behaviour of Apache child 
>> process to grow over time up to a plateau which I believe I noted 
>> previously.
>>
>> So for the Apache child worker processes this isn't ringing any alarm 
>> bells for me.
>>
>> If you have already removed MaxConnectionsPerChild and that maximum 
>> continues along and not increasing too much more over time, then I wouldn't 
>> be seeing a need to be setting MaxConnectionsPerChild.
>>
>> Next chart to look at is the memory chart where monitoring the actual 
>> Python web application.
>>
>> Over the last 6 hours this shows:
>>
>> I am not 100% sure if you can get to this chart on the Lite subscription 
>> level you are on but I can still see it.
>>
>> If it is available on Lite subscription, go to the bottom of the overview 
>> dashboard and select on the chart icon for the server breakout.
>>
>> So as far as what process is growing in memory size, it is very much the 
>> mod_wsgi daemon processes.
>>
>> The problem therefore is going to be an issue with caching in your 
>> application, construction of object cycles that cannot be broken by the 
>> garbage collector, or in worst case the garbage collector isn't even 
>> running due to it dead locking.
>>
>> As a short term remedy to at least keep the memory growth under control 
>> for the mod_wsgi daemon processes, you have two choices.
>>
>> The first is to use the 'inactivity-timeout' option to WSGIDaemonProcess 
>> directive.
>>
>> The behaviour of this is slightly different between mod_wsgi 3.X and 4.X.
>>
>> Under 4.X what will happen when inactivity-timeout is set is that if no 
>> requests at all are received by a daemon process group, and there are no 
>> active ones, after the specified period, then the process will be restarted.
>>
>> You run up to 150 requests per minute across about 7 process instances.
>>
>>
>> In practice this means that any one process is not even handling one 
>> request per second.
>>
>> Even so, what it means is that if you set this to be 30 seconds, then a 
>> process is never going to be idle long enough to be restarted. Can't 
>> therefore use this.
>>
>> On mod_wsgi 3.X the difference was that the reading of request content 
>> and writing of response content was monitored and if there was no activity 
>> on those after the timeout period, then the process would be restarted. 
>> With the way it worked, it meant if there was a long running request, or 
>> stuck requests, they would be interrupted. In mod_wsgi 4.X, you should be 
>> using request-timeout instead if the intent is to be able to interrupt long 
>> running or stuck requests.
>>
>> The second option you can use instead of inactivity-timeout is the 
>> maximum-requests option to WSGIDaemonProcess.
>>
>> This also has changed a bit in mod_wsgi 4.X from 3.X.
>>
>> Under 3.X once that many requests were processed by a process, the 
>> process would be restarted. In restarting though, it followed the typical 
>> shutdown sequence for mod_wsgi daemon processes.
>>
>> That is, no new requests would be accepted on that process. Currently 
>> running requests would be given a grace period to complete (default 2.5 
>> seconds). If they did not complete in that time, the process would be more 
>> brutally shutdown.
>>
>> The result of this was that long running requests could be interrupted, 
>> resulting in some of those errors you saw before.
>>
>> In mod_wsgi 4.X this option has the same default behaviour, but there is 
>> a new way of having the restart be more graceful.
>>
>> What you instead use is both maximum-requests and graceful-timeout to 
>> WSGIDaemonProcess at the same time.
>>
>> When you also have graceful-timeout option specified, what will happen 
>> when the maximum requests is reached is that a new initial grace period is 
>> entered based on what graceful-timeout is set to.
>>
>> During this initial grace period, new requests will still be accepted, 
>> but if at any point during that time there are no active requests, the 
>> process will be restarted immediately.
>>
>> If for some reason the initial grace period expires, then the normal 
>> restart starts. That is, stop accepting requests, provide a further 2.5 
>> seconds grace and then brutally restart.
>>
>> So this new graceful-timeout option in conjunction with maximum-requests 
>> means that a process can be restarted but with a better ability to do so 
>> without interrupting any requests.
>>
>> In your case because there is less than 1 request per second on average 
>> across all processes, there should be a lot of opportunity for a restart to 
>> occur in the gaps in between those infrequent requests.
>>
>> My recommendations at this point are:
>>
>> 1. Set MaxConnectionsPerChild to 0 so that Apache child worker processes 
>> aren't being recycled unnecessarily.
>>
>> Monitor the mod_wsgi plugin chart 'Average Memory Useds (bytes/process)' 
>> to ensure that the Apache child worker processes aren't growing in size in 
>> an unbounded way above the typical max you are seeing now. They should 
>> simply plateau after a while. Keep an eye on max in the popup rather than 
>> the average as the latter can be deceiving as Apache child worker processes 
>> can still be restarted by Apache because of its dynamic scaling feature, 
>> thereby confusing the results you see.
>>
>> 2. Add to WSGIDaemonProcess the maximum-requests and graceful-timeout 
>> option.
>>
>> As you are running only about 20 requests per minute for process, set 
>> maximum-requests at about 250 initially, which would be about every 10 
>> minutes or so at that rate. If the processes are getting recycled too often 
>> and the startup cost is too much, then increase it. At the same time though 
>> keep an eye on that memory chart in the server breakout on the main 
>> dashboard to ensure memory not growing too much before the process gets 
>> recycled.
>>
>> At the same time, make sure you set graceful-timeout. A safe figure for 
>> this initially would be 30. That is the number of seconds to wait to see if 
>> can restart when no requests, before a more brutal restart.
>>
>> 3. If you haven't done it already and you are definitely only using 
>> mod_wsgi daemon mode for running your WSGI application, ensure you are 
>> setting WSGIRestrictEmbedded to On at global scope in the Apache 
>> configuration outside of any VirtualHost.
>>
>> So as an initial remedial action only, to combat memory growth, do that 
>> for now. This will give you some breathing space so you can work out why 
>> your web application is growing memory usage continually.
>>
>> When done and when it has been running for 6-12 hours, come back and we 
>> can review how the memory growth charts are looking. I can also get to some 
>> other charts you can't when on Lite subscription, which will show details 
>> about the mod_wsgi daemon process restarts and we can see how far apart 
>> they are coming and whether we need to adjust maximum-requests.
>>
>> As I say though, this is just to get you in a better position while you 
>> work out your memory leakage problems.
>>
>> A next step I would like to do is have you install an extension to the NR 
>> Python agent which allows the activities of the garbage collector to be 
>> monitored to make sure it is running and whether uncollectable objects are 
>> accumulating. You would not normally be able to use this with a Lite 
>> account but I will be able to see the data. Before I can give you that 
>> extension to try I need to wait on some internal approvals to release it, 
>> which are already in train due to us needing to provide it to some other 
>> users who are seeing memory growth issues which we are trying to help them 
>> with.
>>
>> For interests sake, the charts it can produce are:
>>
>> [image: custom-dashboard.jpg]
>> Anyway, work slowy through what I have said above and understand it. Get 
>> those few configuration changes made and then we will look at the next 
>> steps.
>>
>> Graham
>>
>> On 31/10/2014, at 1:12 PM, Kelvin Wong <[email protected]> wrote:
>>
>> I'm sorry to throw too much questions to you. I just want to provide more 
>> information to make it easier to find the cause.
>> Also, I rushed to you because this situation happened during sleep time. 
>> So I'm little stressed about it.
>>
>> Back to your question.
>> After following your instructions to setup mod_wsgi plugin to monitor the 
>> mod_wsgi processes in New Relic, I take a look on the data.
>> I found that it seemed normal to me. Before the 503 situation, "Average 
>> Memory Used (bytes/process)" and "Process Count" do not have great changes.
>> But looking data from "Server", the Apache processes still accumulating.
>> So, can we conclude that Apache child worker processes growing in memory?
>>
>> Also, as I post previously, the main Apache error log, 
>> [Wed Oct 29 12:56:26.727197 2014] [mpm_worker:error] [pid 1322:tid 
>> 139958218430336] AH00287: server is within MinSpareThreads of 
>> MaxRequestWorkers, consider raising the MaxRequestWorkers setting
>> [Wed Oct 29 12:56:30.730902 2014] [mpm_worker:error] [pid 1322:tid 
>> 139958218430336] AH00286: server reached MaxRequestWorkers setting, 
>> consider raising the MaxRequestWorkers setting
>>
>> I changed the config accordingly, the 503 situation still happened.
>> <IfModule mpm_worker_module>
>>         StartServers                 2
>>         MinSpareThreads             25
>>         MaxSpareThreads             75
>>         ThreadLimit                 64
>>         ThreadsPerChild             25
>>         MaxRequestWorkers          400
>>         MaxConnectionsPerChild       0
>> </IfModule>
>>
>>
>> Thanks.
>>
>>
>> On Friday, October 31, 2014 3:30:58 AM UTC+8, Graham Dumpleton wrote:
>>
>> Before I can help you, you need to start answering my questions. If you 
>> pepper me with email after email on different things and don't answer the 
>> questions I ask you, then we will get no where. When I ask questions it is 
>> because that information is likely key to me being able to understand the 
>> problem and help you.
>>
>> The main question I put to you was about how you are determining which 
>> processes are growing in memory.
>>
>> It needs to be properly determined whether the processes which were 
>> growing in memory were Apache child worker processes or mod_wsgi daemon 
>> processes.
>>
>> The reason for this is that if they are in fact mod_wsgi daemon 
>> processes, then you are possibly in part causing the problem yourself by 
>> setting MaxConnectionsPerChild. That setting has got nothing to do with 
>> the mod_wsgi daemon processes and by setting it, you are creating the 
>> conditions which cause the sorts of errors you are seeing.
>>
>> So, which processes are growing in memory?
>>
>> Graham 
>>
>> On 31/10/2014, at 12:40 AM, Kelvin Wong <[email protected]> wrote:
>>
>> I found that on the main Apache log, when it is occurring, Apache is 
>> doing graceful restart.
>> How can I avoid Apache graceful restart?
>>
>> Thanks.
>>
>>
>> On Tuesday, October 28, 2014 1:30:06 PM UTC+8, Graham Dumpleton wrote:
>>
>> Would suggest upgrading to mod_wsgi 4.3.0 if you can as the error 
>> messages when there are communication problems between Apache child worker 
>> process and mod_wsgi daemon process have been improved.
>>
>> More comments below.
>>
>> On 28 October 2014 15:43, Kelvin Wong <[email protected]> wrote:
>>
>> Hi Graham and everyone else
>>
>> I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), 
>> OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker.
>> I found that the server start returning 504 and then 503, and the 
>> following error shown up.
>> I researched some issues related with it, even added "WSGISocketPrefix 
>> /var/run/apache2/wsgi", but the issue still occured.
>> I have no idea why it happened. Can anyone give some directions on this 
>> issue?
>>
>> Thanks!
>>
>> apache error log
>> [Sun Oct 26 07:34:34.732934 2014] [wsgi:error] [pid 29268:tid 
>> 140053011478272] [client xx.xxx.xxx.xxx:xxxxx] Timeout when reading 
>> response headers from daemon process 'site-1': /home/ubuntu/site-1/apache
>> /wsgi.py
>> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 
>> 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.
>> xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 
>> 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.
>>
>>
>> This one can occur when the mod_wsgi daemon process crashes. There should 
>> be a segmentation fault error message or similar in the main Apache error 
>> log (not VirtualHost specific log).
>>
>> It can also occur if there are incomplete requests still running when a 
>> mod_wsgi daemon process is shutdown on being restarted due to the WSGI 
>> script file being touched or if Apache was restarted. In the latter case, 
>> the mod_wsgi daemon process would have had to have been killed off by 
>> Apache before the Apache child worker process which was proxying it to had. 
>> This can especially be the case if an Apache graceful restart was being 
>> done. 
>>  
>>
>> occasionally
>> [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 
>> 140182690981632] (2)No such file or directory: [client 24.171.250.159:
>> 60769] mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 
>> 'snaptee-production-api-ssl' on '/var/run/apache2/wsgi.30188.7.3.sock'.
>>
>>
>> This can also be due to Apache graceful restart being done and there were 
>> keep alive connections being handled from a HTTP client. In an Apache 
>> graceful restart, because of Apache handles the mod_wsgi daemon processes, 
>> they don't have a graceful shutdown in the same way as Apache child worker 
>> processes.
>>
>> So what happens is the following:
>>
>> 1. Apache graceful restart is triggered.
>> 2. Apache parent process sends SIGUSR1 to Apache child worker process to 
>> signal graceful shutdown.
>> 3. Apache parent process sends SIGINT to mod_wsgi daemon processes to 
>> signal shutdown.
>> 4. The mod_wsgi daemon processes complete their requests and restart. In 
>> the next incarnation of the mod_wsgi daemon processes after an Apache 
>> restart they expect a different path for the proxy socket, with the number 
>> at the end increasing based on Apache generation number.
>> 5. The Apache child worker process because it was in a graceful restart 
>> mode, operates on the understanding that it can keep handling any requests 
>> on a keep alive socket connection from a HTTP client until there are no 
>> more. It therefore takes next request on same connection and tries to 
>> connect to mod_wsgi daemon process, but using the proxy socket name as was 
>> used before, but that name has changed for the next Apache configuration 
>> generation and no longer exists, thus it fails.
>>
>> The name of the proxy socket changes across Apache restarts because 
>> otherwise you could have Apache child worker processes under an old 
>> configuration sending requests to a mod_wsgi daemon process using the new 
>> configuration, which could cause problems including security issues. There 
>> are therefore specific protections in place to ensure that only Apache 
>> child worker processes and mod_wsgi daemon mode processes created against 
>> the same Apache configuration generation can talk to each other.
>>  
>>
>> wsgi config for that site
>> WSGIDaemonProcess site-1 display-name=site-1 user=www-data threads=25 
>> python-path=/home/ubuntu/site-1/django:/home/ubuntu/.virtualenvs/site-1/
>> lib/python2.7/site-packages
>> <div style="font-family:Arial,Helvetica,sans-serif;background-col
>>
>> ...
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "modwsgi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] <javascript:>
> .
> Visit this group at http://groups.google.com/group/modwsgi.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Re: [modwsgi] mod_wsgi returning 503 service unavailable

Reply via email to