Re: [modwsgi] mod_wsgi returning 503 service unavailable

Graham Dumpleton Sun, 02 Nov 2014 19:00:28 -0800

And when is 'cron' configured to run logrotate? Does it correspond to when you 
are seeing the issue?


On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote:

> Yes, the memory usage is under control now and I increased maximum-requests 
> by 4 times.
> 
> For the error, 
> Timeout when reading response headers from daemon process 'site-1': 
> /home/ubuntu/site-1/apache/wsgi.py
> came out as sudden, that's why I capture this message only. I also found main 
> Apache error log nothing special.
> 
> Now, I turned the log level to 'info' in order to obtain more messages.
> 
> About the log file rotation, I only found the ubuntu default rotation one 
> which is using logrotate.
> /var/log/apache2/*.log {
>         weekly
>         missingok
>         rotate 52
>         compress
>         delaycompress
>         notifempty
>         create 640 root adm
>         sharedscripts
>         postrotate
>                 if /etc/init.d/apache2 status > /dev/null ; then \
>                     /etc/init.d/apache2 reload > /dev/null; \
>                 fi;
>         endscript
>         prerotate
>                 if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
>                         run-parts /etc/logrotate.d/httpd-prerotate; \
>                 fi; \
>         endscript
> }
> It will do Apache reload on postrotate. I research a little bit, the reload 
> is needed for Apache handling file.
> Some suggested "copytruncate" on the config to avoid reload. Will you also 
> suggest that?
> 
> 
> Thanks.
> 
> 
> On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote:
> 
> On 02/11/2014, at 2:18 AM, Kelvin Wong <[email protected]> wrote:
> 
> 
> I have some updates on the issue as it happened again. This time I captured 
> some data from New Relic.
> 
> From the Requests.png, I found that there is an spike aroung 22:23 and check 
> the Apache log. I found there are many as following.
> Timeout when reading response headers from daemon process 'site-1': 
> /home/ubuntu/site-1/apache/wsgi.py
> 
> Spike also found on the Workers.png during that time period. The graph is 
> showing most of the worker is doing write?
> After a few minute, the following log appear again, and start returning 503.
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi 
> (pid=27816): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.17227.2.3.sock'.
> 
> Can you tell what happened from those graphs?
> 
> What I need to see is more of the log file around those events.
> 
> Ensure you have LogLevel set to 'info' in the Apache configuration as 
> mod_wsgi will then log more messages about why processes restarted.
> 
> Also ensure you are checking for message around the time in the global Apache 
> error log and not just in any virtual host specific error log, as some 
> messages, such as evidence of a restart or process crashes will only appear 
> in the main Apache error log.
> 
> With memory growth under control:
> 
> 
> 
> turning to look at daemon process restarts:
> 
> 
> 
> they are obviously more frequent. Zooming in, each process is restarting 
> about every 5 minutes, although varies over time.
> 
> I would suggest you can increase maximum-requests, possibly as high as 4 
> times what you are using.
> 
> When enabling 'info' logging that will reduce the amount of noise in the 
> error logs which may hide the specific anomalous event you are trying to 
> track down.
> 
> BTW. Did you look into whether any log file rotation is set up or not and 
> what did you find?
> 
> Depending on the system you might be able to find whether something is 
> running by looking at one cron jobs are running for root.
> 
>     sudo crontab -l
> 
> Graham
> 
> 
> Thanks.
> 
> 
> On Friday, October 31, 2014 5:08:27 PM UTC+8, Graham Dumpleton wrote:
> 
> On 31/10/2014, at 7:26 PM, Kelvin Wong <[email protected]> wrote:
> 
> Oh. I see. Thanks.
> 
> One more question.
> As I set the graceful-timeout on mod_wsgi process, then when Apache do 
> graceful restart, will mod_wsgi obey that timeout setting?
> 
> No. It is impossible. There is simply no way for the mod_wsgi daemon 
> processes to act in coordination with the Apache child worker processes on an 
> Apache graceful restart.
> 
> This is because the only way that mod_wsgi can create its distinct child 
> processes and have them managed by Apache at all is using a mechanism that in 
> the Apache runtime library is referred to as a 'other processes'.
> 
> It doesn't matter whether Apache as a whole is signalled for a normal 
> shutdown (SIGHUP) or a graceful restart (SIGUSR1), Apache will in turn always 
> send a SIGINT on to 'other processes' such as those created by mod_wsgi. If 
> those 'other processes' don't shutdown immediately it will signal them again 
> at one second intervals until 3 seconds have elapsed, at which point it will 
> send them a SIGKILL. So Apache only supports a brutal shutdown of that sort 
> for 'other processes'.
> 
> For its own child worker processes, to even support a graceful restart is a 
> non trivial thing. It isn't just a matter of sending the child worker 
> processes a SIGUSR1. Instead there is an elaborate scheme involving what is 
> called a pipe of death between the Apache parent and the child worker 
> processes, as well as shared state via shared memory between the Apache 
> parent process so the child worker processes now what sort of shutdown has 
> been triggered. That sort of mechanism to support a graceful shutdown for 
> 'other processes' simply does exist.
> 
> So it is how it is and there is no way to change things. In practice it 
> hasn't caused any issues and web requests being interrupted is always a risk 
> that a web application has to deal with anyway.
> 
> Graham
> 
> On Friday, October 31, 2014 4:16:55 PM UTC+8, Graham Dumpleton wrote:
> 
> 
> On 31/10/2014, at 6:53 PM, Kelvin Wong <[email protected]> wrote:
> 
> You means that the OS triggered Apache graceful restart? If so, why the OS do 
> that?
> 
> Yes and I just told you why, as part of a log rotation mechanism.
> 
> Try googling 'log file rotation' and see what mechanism is used on your 
> specific Linux system.
> 
> One link about one such solution which is often used is:
> 
> http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/
> 
> Apache does itself have a way of handling log rotation itself, but likely 
> ignored by Linux distros which use their own preferred mechanism.
> 
> Graham
> 
> 
> On Friday, October 31, 2014 3:45:35 PM UTC+8, Graham Dumpleton wrote:
> Your operating system could be signalling Apache during the middle of the 
> night so that the log files can be rotated.
> 
> Graham
> 
> On 31/10/2014, at 6:36 PM, Kelvin Wong <[email protected]> wrote:
> 
> I checked the memory of overall Apache process, it goes up and down a little. 
> It Seemed the "maximum-requests" is doing its job.
> 
> I found that when I do "sudo service apache reload", Apache will do graceful 
> restart as it shown in main Apache error log.
> 
> 
> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 
> 140052910765824] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.
> When before above situation happened, Apache also doing graceful restart.
> But this should not be triggered by anyone as I mentioned it happened during 
> sleep time.
> Is there any reasons that Apache do graceful restart by itself?
> 
> Thanks.
> 
> 
> On Friday, October 31, 2014 2:55:32 PM UTC+8, Graham Dumpleton wrote:
> BTW, in your NR agent configuration file find the 
> error_collector.ignore_errors setting and change it to:
> 
> error_collector.ignore_errors = django.http.request:UnreadablePostError
> 
> These indicate the HTTP client connection dropping before all content had 
> been sent.
> 
> Generally you want to ignore these and eliminate them from your error count 
> as they do not represent an error in your application but are the result of a 
> client problem.
> 
> Adding that configuration setting will cause them to be ignored and not 
> recorded as an error.
> 
> Graham
> 
> On 31/10/2014, at 5:41 PM, Graham Dumpleton <[email protected]> wrote:
> 
> 
> On 31/10/2014, at 5:04 PM, Kelvin Wong <[email protected]> wrote:
> 
> Thanks for the details!
> 
> I now have a better picture on how it works.
> I employed all the recommendations you suggested. I will keep monitoring the 
> processes.
> 
> But how about the graceful restart signaled by Apache? Why it happened?
> 
> As far as the message:
> 
> [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 
> 140182690981632] (2)No such file or directory: [client 24.171.250.159:60769] 
> mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 
> 'snaptee-production-api-ssl' on'/var/run/apache2/wsgi.30188.7.3.sock'.
> 
> goes, as I said I believe that could only be the result of an Apache graceful 
> restart.
> 
> My recollection is that a 'reload' under a Linux system using its process 
> management tools triggers an Apache graceful restart where as a 'restart' 
> does a normal Apache restart.
> 
> I believe you said you used 'reload' and so that would be the source of that.
> 
> The whole reason though I have been trying to sort out what processes 
> actually had memory issues though was to avoid MaxConnectionsPerChild as well 
> to ensure that that wasn't a source of the problem as well.
> 
> Now that we have you on a better configuration lets see now whether it 
> happens again, rather than worrying too much about what happened in the past. 
> Just note when you happen to do a 'reload', and so understand that if seen at 
> that time that that was likely the cause.
> 
> Once things settle down for a while then can look at your processes/threads 
> configuration and capacity utilisation and so work out whether a poor 
> configuration there is causing problems. Can also see whether you are getting 
> stuck requests as well which are eating into capacity and which could result 
> in some of the errors you were seeing when any sort of restart was done.
> 
> So slowly slowly. Resolve one issue at a time. Changing too many things at 
> once will only confuse and you will not get a sense of what changes are 
> actually making things better otherwise.
> 
> Graham
> 
> 
> Thanks again.
> 
> 
> On Friday, October 31, 2014 12:29:12 PM UTC+8, Graham Dumpleton wrote:
> So lets look at just memory initially.
> 
> When we talk about Apache processes, there are actually going to be three 
> different classes of processes.
> 
> These are:
> 
> 1. The Apache parent process.
> 
> This runs as root. Because it runs as root and not the Apache user (www-data) 
> that the Apache child worker processes run as, then it can be seen separately 
> on the NR server monitoring.
> 
> 
> There is no memory increase here. It is actually very important that the 
> Apache parent process doesn't increase in size over time with restarts as 
> when it forks child process they would inherit that increasing memory size. 
> One of the reason that using mod_python is a bad idea is that it used to 
> initialise the Python interpreter in the Apache parent process and when 
> restarted, the Python interpreter would leak memory and so the Apache parent 
> process would grow in size on each restart.
> 
> Anyway, you are good here and nothing to worry about.
> 
> 2. The Apache child worker processes.
> 
> These run as the Apache user which on your system is www-data.
> 
> The Apache child worker processes are what accepts all inbound HTTP requests. 
> Static files will be served up by these processes.
> 
> In your case where you are using mod_wsgi daemon processes, the Python web 
> application does not run in the Apache child worker processes. The Apache 
> child worker process still accept the requests initially though and would 
> then proxy the request through to the daemon process running the WSGI 
> application.
> 
> 3. The mod_wsgi daemon processes.
> 
> By default these also run as the Apache user which on your system is www-data.
> 
> This is where your WSGI application is running.
> 
> Now both the Apache child worker processes and the mod_wsgi daemon processes 
> are forked children of the Apache parent process. They will therefore all 
> show as being the process 'apache2' on your system.
> 
> Because both (1) and (2) are running as the Apache user, NR server monitoring 
> will not allow you to distinguish between them.
> 
> Thus when you get:
> 
> 
> you aren't looking at just the Apache child worker processes and are also 
> looking at the mod_wsgi daemon processes.
> 
> The only way you could use NR server monitoring to tell between the two is if 
> you used the 'user' option to WSGIDaemonProcess and ran the mod_wsgi daemon 
> processes as a different user again. Then you could distinguish.
> 
> Although you can't distinguish using NR server monitoring, you can on Linux 
> if you ensure you are naming the processes. That is, when using 
> WSGIDaemonProcess directive, supply the display-name=%{GROUP} option. This 
> will allow you to see the mod_wsgi daemon processes as distinct if using 'ps' 
> or 'htop'.
> 
> Lets instead now look at what the mod_wsgi plugin tells you:
> 
> Over the 24 hours matching last charts we get:
> 
> 
> so nothing conclusive. For a shorter period of one hour we get:
> 
> 
> This shows an increase over time, but you have to be a little bit careful in 
> interpreting that. If you hover over the memory chart and look at the popup 
> you will see:
> 
> 
> This shows that the processes actually range in size. There is a range of 
> sizes because the process churn chart shows that processes are being 
> restarted at different times.
> 
> Important though is to look at the max value. This is the largest size of any 
> process. If one moves along the chart for that period one sees that the 
> maximum is always about 12.9-13.0 MB.
> 
> This indicates to me that the Apache child worker process size has reached a 
> plateau and that this is the normal behaviour of Apache child process to grow 
> over time up to a plateau which I believe I noted previously.
> 
> So for the Apache child worker processes this isn't ringing any alarm bells 
> for me.
> 
> If you have already removed MaxConnectionsPerChild and that maximum continues 
> along and not increasing too much more over time, then I wouldn't be seeing a 
> need to be setting MaxConnectionsPerChild.
> 
> Next chart to look at is the memory chart where monitoring the actual Python 
> web application.
> 
> Over the last 6 hours this shows:
> 
> 
> I am not 100% sure if you can get to this chart on the Lite subscription 
> level you are on but I can still see it.
> 
> If it is available on Lite subscription, go to the bottom of the overview 
> dashboard and select on the chart icon for the server breakout.
> 
> 
> So as far as what process is growing in memory size, it is very much the 
> mod_wsgi daemon processes.
> 
> The problem therefore is going to be an issue with caching in your 
> application, construction of object cycles that cannot be broken by the 
> garbage collector, or in worst case the garbage collector isn't even running 
> due to it dead locking.
> 
> As a short term remedy to at least keep the memory growth under control for 
> the mod_wsgi daemon processes, you have two choices.
> 
> The first is to use the 'inactivity-timeout' option to WSGIDaemonProcess 
> directive.
> 
> The behaviour of this is slightly different between mod_wsgi 3.X and 4.X.
> 
> Under 4.X what will happen when inactivity-timeout is set is that if no 
> requests at all are received by a daemon process group, and there are no 
> active ones, after the specified period, then the process will be restarted.
> 
> You run up to 150 requests per minute across about 7 process instances.
> 
> 
> 
> In practice this means that any one process is not even handling one request 
> per second.
> 
> Even so, what it means is that if you set this to be 30 seconds, then a 
> process is never going to be idle long enough to be restarted. Can't 
> therefore use this.
> 
> On mod_wsgi 3.X the difference was that the reading of request content and 
> writing of response content was monitored and if there was no activity on 
> those after the timeout period, then the process would be restarted. With the 
> way it worked, it meant if there was a long running request, or stuck 
> requests, they would be interrupted. In mod_wsgi 4.X, you should be using 
> request-timeout instead if the intent is to be able to interrupt long running 
> or stuck requests.
> 
> The second option you can use instead of inactivity-timeout is the 
> maximum-requests option to WSGIDaemonProcess.
> 
> This also has changed a bit in mod_wsgi 4.X from 3.X.
> 
> Under 3.X once that many requests were processed by a process, the process 
> would be restarted. In restarting though, it followed the typical shutdown 
> sequence for mod_wsgi daemon processes.
> 
> That is, no new requests would be accepted on that process. Currently running 
> requests would be given a grace period to complete (default 2.5 seconds). If 
> they did not complete in that time, the process would be more brutally 
> shutdown.
> 
> The result of this was that long running requests could be interrupted, 
> resulting in some of those errors you saw before.
> 
> In mod_wsgi 4.X this option has the same default behaviour, but there is a 
> new way of having the restart be more graceful.
> 
> What you instead use is both maximum-requests and graceful-timeout to 
> WSGIDaemonProcess at the same time.
> 
> When you also have graceful-timeout option specified, what will happen when 
> the maximum requests is reached is that a new initial grace period is entered 
> based on what graceful-timeout is set to.
> 
> During this initial grace period, new requests will still be accepted, but if 
> at any point during that time there are no active requests, the process will 
> be restarted immediately.
> 
> If for some reason the initial grace period expires, then the normal restart 
> starts. That is, stop accepting requests, provide a further 2.5 seconds grace 
> and then brutally restart.
> 
> So this new graceful-timeout option in conjunction with maximum-requests 
> means that a process can be restarted but with a better ability to do so 
> without interrupting any requests.
> 
> In your case because there is less than 1 request per second on average 
> across all processes, there should be a lot of opportunity for a restart to 
> occur in the gaps in between those infrequent requests.
> 
> My recommendations at this point are:
> 
> 1. Set MaxConnectionsPerChild to 0 so that Apache child worker processes 
> aren't being recycled unnecessarily.
> 
> Monitor the mod_wsgi plugin chart 'Average Memory Useds (bytes/process)' to 
> ensure that the Apache child worker processes aren't growing in size in an 
> unbounded way above the typical max you are seeing now. They should simply 
> plateau after a while. Keep an eye on max in the popup rather than the 
> average as the latter can be deceiving as Apache child worker processes can 
> still be restarted by Apache because of its dynamic scaling feature, thereby 
> confusing the results you see.
> 
> 2. Add to WSGIDaemonProcess the maximum-requests and graceful-timeout option.
> 
> As you are running only about 20 requests per minute for process, set 
> maximum-requests at about 250 initially, which would be about every 10 
> minutes or so at that rate. If the processes are getting recycled too often 
> and the startup cost is too much, then increase it. At the same time though 
> keep an eye on that memory chart in the server breakout on the main dashboard 
> to ensure memory not growing too much before the process gets recycled.
> 
> At the same time, make sure you set graceful-timeout. A safe figure for this 
> initially would be 30. That is the number of seconds to wait to see if can 
> restart when no requests, before a more brutal restart.
> 
> 3. If you haven't done it already and you are definitely only using mod_wsgi 
> daemon mode for running your WSGI application, ensure you are setting 
> WSGIRestrictEmbedded to On at global scope in the Apache configuration 
> outside of any VirtualHost.
> 
> So as an initial remedial action only, to combat memory growth, do that for 
> now. This will give you some breathing space so you can work out why your web 
> application is growing memory usage continually.
> 
> When done and when it has been running for 6-12 hours, come back and we can 
> review how the memory growth charts are looking. I can also get to some other 
> charts you can't when on Lite subscription, which will show details about the 
> mod_wsgi daemon process restarts and we can see how far apart they are coming 
> and whether we need to adjust maximum-requests.
> 
> As I say though, this is just to get you in a better position while you work 
> out your memory leakage problems.
> 
> A next step I would like to do is have you install an extension to the NR 
> Python agent which allows the activities of the garbage collector to be 
> monitored to make sure it is running and whether uncollectable objects are 
> accumulating. You would not normally be able to use this with a Lite account 
> but I will be able to see the data. Before I can give you that extension to 
> try I need to wait on some internal approvals to release it, which are 
> already in train due to us needing to provide it to some other users who are 
> seeing memory growth issues which we are trying to help them with.
> 
> For interests sake, the charts it can produce are:
> 
> 
> Anyway, work slowy through what I have said above and understand it. Get 
> those few configuration changes made and then we will look at the next steps.
> 
> Graham
> 
> On 31/10/2014, at 1:12 PM, Kelvin Wong <[email protected]> wrote:
> 
> I'm sorry to throw too much questions to you. I just want to provide more 
> information to make it easier to find the cause.
> Also, I rushed to you because this situation happened during sleep time. So 
> I'm little stressed about it.
> 
> Back to your question.
> After following your instructions to setup mod_wsgi plugin to monitor the 
> mod_wsgi processes in New Relic, I take a look on the data.
> I found that it seemed normal to me. Before the 503 situation, "Average 
> Memory Used (bytes/process)" and "Process Count" do not have great changes.
> But looking data from "Server", the Apache processes still accumulating.
> So, can we conclude that Apache child worker processes growing in memory?
> 
> Also, as I post previously, the main Apache error log, 
> [Wed Oct 29 12:56:26.727197 2014] [mpm_worker:error] [pid 1322:tid 
> 139958218430336] AH00287: server is within MinSpareThreads of 
> MaxRequestWorkers, consider raising the MaxRequestWorkers setting
> [Wed Oct 29 12:56:30.730902 2014] [mpm_worker:error] [pid 1322:tid 
> 139958218430336] AH00286: server reached MaxRequestWorkers setting, consider 
> raising the MaxRequestWorkers setting
> 
> I changed the config accordingly, the 503 situation still happened.
> <IfModule mpm_worker_module>
>         StartServers                 2
>         MinSpareThreads             25
>         MaxSpareThreads             75
>         ThreadLimit                 64
>         ThreadsPerChild             25
>         MaxRequestWorkers          400
>         MaxConnectionsPerChild       0
> </IfModule>
> 
> 
> Thanks.
> 
> 
> On Friday, October 31, 2014 3:30:58 AM UTC+8, Graham Dumpleton wrote:
> Before I can help you, you need to start answering my questions. If you 
> pepper me with email after email on different things and don't answer the 
> questions I ask you, then we will get no where. When I ask questions it is 
> because that information is likely key to me being able to understand the 
> problem and help you.
> 
> The main question I put to you was about how you are determining which 
> processes are growing in memory.
> 
> It needs to be properly determined whether the processes which were growing 
> in memory were Apache child worker processes or mod_wsgi daemon processes.
> 
> The reason for this is that if they are in fact mod_wsgi daemon processes, 
> then you are possibly in part causing the problem yourself by setting 
> MaxConnectionsPerChild. That setting has got nothing to do with the mod_wsgi 
> daemon processes and by setting it, you are creating the conditions which 
> cause the sorts of errors you are seeing.
> 
> So, which processes are growing in memory?
> 
> Graham 
> 
> On 31/10/2014, at 12:40 AM, Kelvin Wong <[email protected]> wrote:
> 
> I found that on the main Apache log, when it is occurring, Apache is doing 
> graceful restart.
> How can I avoid Apache graceful restart?
> 
> Thanks.
> 
> 
> On Tuesday, October 28, 2014 1:30:06 PM UTC+8, Graham Dumpleton wrote:
> Would suggest upgrading to mod_wsgi 4.3.0 if you can as the error messages 
> when there are communication problems between Apache child worker process and 
> mod_wsgi daemon process have been improved.
> 
> More comments below.
> 
> On 28 October 2014 15:43, Kelvin Wong <[email protected]> wrote:
> Hi Graham and everyone else
> 
> I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), 
> OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker.
> I found that the server start returning 504 and then 503, and the following 
> error shown up.
> I researched some issues related with it, even added "WSGISocketPrefix 
> /var/run/apache2/wsgi", but the issue still occured.
> I have no idea why it happened. Can anyone give some directions on this issue?
> 
> Thanks!
> 
> apache error log
> [Sun Oct 26 07:34:34.732934 2014] [wsgi:error] [pid 29268:tid 
> 140053011478272] [client xx.xxx.xxx.xxx:xxxxx] Timeout when reading response 
> headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27
> ...
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "modwsgi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/modwsgi.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Re: [modwsgi] mod_wsgi returning 503 service unavailable

Reply via email to