I tried some load testing on the server to see if it is caused by overloading.
I found there will be some other log message when the server is overloading. 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or oversized response headers received from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. Then I concluded that the situation is not caused by overloading the server. Do you have any directions for me to reproduce the situation? Thanks. On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote: > > I found no cron is running on root and user. > > According to the logrotate config, the log rotate weekly. > But the error situation is likely happened nightly. > > So I think it will not be my concern. > > > On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote: >> >> And when is 'cron' configured to run logrotate? Does it correspond to >> when you are seeing the issue? >> >> On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote: >> >> Yes, the memory usage is under control now and I increased >> maximum-requests by 4 times. >> >> For the error, >> Timeout when reading response headers from daemon process 'site-1': >> /home/ubuntu/site-1/apache/wsgi.py >> came out as sudden, that's why I capture this message only. I also found >> main Apache error log nothing special. >> >> Now, I turned the log level to 'info' in order to obtain more messages. >> >> About the log file rotation, I only found the ubuntu default rotation one >> which is using logrotate. >> /var/log/apache2/*.log { >> weekly >> missingok >> rotate 52 >> compress >> delaycompress >> notifempty >> create 640 root adm >> sharedscripts >> postrotate >> if /etc/init.d/apache2 status > /dev/null ; then \ >> /etc/init.d/apache2 reload > /dev/null; \ >> fi; >> endscript >> prerotate >> if [ -d /etc/logrotate.d/httpd-prerotate ]; then \ >> run-parts /etc/logrotate.d/httpd-prerotate; \ >> fi; \ >> endscript >> } >> It will do Apache reload on postrotate. I research a little bit, the >> reload is needed for Apache handling file. >> Some suggested "copytruncate" on the config to avoid reload. Will you >> also suggest that? >> >> >> Thanks. >> >> >> On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote: >> >> >> On 02/11/2014, at 2:18 AM, Kelvin Wong <[email protected]> wrote: >> >> >> I have some updates on the issue as it happened again. This time I >> captured some data from New Relic. >> >> From the Requests.png, I found that there is an spike aroung 22:23 and >> check the Apache log. I found there are many as following. >> Timeout when reading response headers from daemon process 'site-1': >> /home/ubuntu/site-1/apache/wsgi.py >> >> Spike also found on the Workers.png during that time period. The graph is >> showing most of the worker is doing write? >> After a few minute, the following log appear again, and start returning >> 503. >> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] >> mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' >> on '/var/run/apache2/wsgi.17227.2.3.sock'. >> >> Can you tell what happened from those graphs? >> >> >> What I need to see is more of the log file around those events. >> >> Ensure you have LogLevel set to 'info' in the Apache configuration as >> mod_wsgi will then log more messages about why processes restarted. >> >> Also ensure you are checking for message around the time in the global >> Apache error log and not just in any virtual host specific error log, as >> some messages, such as evidence of a restart or process crashes will only >> appear in the main Apache error log. >> >> With memory growth under control: >> >> >> turning to look at daemon process restarts: >> >> >> they are obviously more frequent. Zooming in, each process is restarting >> about every 5 minutes, although varies over time. >> >> I would suggest you can increase maximum-requests, possibly as high as 4 >> times what you are using. >> >> When enabling 'info' logging that will reduce the amount of noise in the >> error logs which may hide the specific anomalous event you are trying to >> track down. >> >> BTW. Did you look into whether any log file rotation is set up or not and >> what did you find? >> >> Depending on the system you might be able to find whether something is >> running by looking at one cron jobs are running for root. >> >> sudo crontab -l >> >> Graham >> >> >> Thanks. >> >> >> On Friday, October 31, 2014 5:08:27 PM UTC+8, Graham Dumpleton wrote: >> >> >> On 31/10/2014, at 7:26 PM, Kelvin Wong <[email protected]> wrote: >> >> Oh. I see. Thanks. >> >> One more question. >> As I set the graceful-timeout on mod_wsgi process, then when Apache do >> graceful restart, will mod_wsgi obey that timeout setting? >> >> >> No. It is impossible. There is simply no way for the mod_wsgi daemon >> processes to act in coordination with the Apache child worker processes on >> an Apache graceful restart. >> >> This is because the only way that mod_wsgi can create its distinct child >> processes and have them managed by Apache at all is using a mechanism that >> in the Apache runtime library is referred to as a 'other processes'. >> >> It doesn't matter whether Apache as a whole is signalled for a normal >> shutdown (SIGHUP) or a graceful restart (SIGUSR1), Apache will in turn >> always send a SIGINT on to 'other processes' such as those created by >> mod_wsgi. If those 'other processes' don't shutdown immediately it will >> signal them again at one second intervals until 3 seconds have elapsed, at >> which point it will send them a SIGKILL. So Apache only supports a brutal >> shutdown of that sort for 'other processes'. >> >> For its own child worker processes, to even support a graceful restart is >> a non trivial thing. It isn't just a matter of sending the child worker >> processes a SIGUSR1. Instead there is an elaborate scheme involving what is >> called a pipe of death between the Apache parent and the child worker >> processes, as well as shared state via shared memory between the Apache >> parent process so the child worker processes now what sort of shutdown has >> been triggered. That sort of mechanism to support a graceful shutdown for >> 'other processes' simply does exist. >> >> So it is how it is and there is no way to change things. In practice it >> hasn't caused any issues and web requests being interrupted is always a >> risk that a web application has to deal with anyway. >> >> Graham >> >> On Friday, October 31, 2014 4:16:55 PM UTC+8, Graham Dumpleton wrote: >> >> >> >> On 31/10/2014, at 6:53 PM, Kelvin Wong <[email protected]> wrote: >> >> You means that the OS triggered Apache graceful restart? If so, why the >> OS do that? >> >> >> Yes and I just told you why, as part of a log rotation mechanism. >> >> Try googling 'log file rotation' and see what mechanism is used on your >> specific Linux system. >> >> One link about one such solution which is often used is: >> >> http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/ >> >> Apache does itself have a way of handling log rotation itself, but likely >> ignored by Linux distros which use their own preferred mechanism. >> >> Graham >> >> >> On Friday, October 31, 2014 3:45:35 PM UTC+8, Graham Dumpleton wrote: >> >> Your operating system could be signalling Apache during the middle of the >> night so that the log files can be rotated. >> >> Graham >> >> On 31/10/2014, at 6:36 PM, Kelvin Wong <[email protected]> wrote: >> >> I checked the memory of overall Apache process, it goes up and down a >> little. It Seemed the "maximum-requests" is doing its job. >> >> I found that when I do "sudo service apache reload", Apache will do >> graceful restart as it shown in main Apache error log. >> >> >> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid >> 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx. >> xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process >> 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'. >> When before above situation happened, Apache also doing graceful restart. >> But this should not be triggered by anyone as I mentioned it happened >> during sleep time. >> Is there any reasons that Apache do graceful restart by itself? >> >> Thanks. >> >> >> On Friday, October 31, 2014 2:55:32 PM UTC+8, Graham Dumpleton wrote: >> >> BTW, in your NR agent configuration file find the >> error_collector.ignore_errors setting and change it to: >> >> error_collector.ignore_errors = django.http.request:UnreadablePostError >> >> These indicate the HTTP client connection dropping before all content had >> been sent. >> >> Generally you want to ignore these and eliminate them from your error >> count as they do not represent an error in your application but are the >> result of a client problem. >> >> Adding that configuration setting will cause them to be ignored and not >> recorded as an error. >> >> Graham >> >> On 31/10/2014, at 5:41 PM, Graham Dumpleton <[email protected]> >> wrote: >> >> >> On 31/10/2014, at 5:04 PM, Kelvin Wong <[email protected]> wrote: >> >> Thanks for the details! >> >> I now have a better picture on how it works. >> I employed all the recommendations you suggested. I will keep monitoring >> the processes. >> >> But how about the graceful restart signaled by Apache? Why it happened? >> >> >> As far as the message: >> >> [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid >> 140182690981632] (2)No such >> file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): >> Unable to >> connect to WSGI daemon >> process 'snaptee-production-api-ssl' >> on'/var/run/apache2/wsgi.30188.7.3.sock'. >> >> goes, as I said I believe that could only be the result of an Apache >> graceful restart. >> >> My recollection is that a 'reload' under a Linux system using its process >> management tools triggers an Apache graceful restart where as a 'restart' >> does a normal Apache restart. >> >> I believe you said you used 'reload' and so that would be the source of >> that. >> >> The whole reason though I have been trying to sort out what processes >> actually had memory issues though was to avoid MaxConnectionsPerChild as >> well to ensure that that wasn't a source of the problem as well. >> >> Now that we have you on a better configuration lets see now whether it >> happens again, rather than worrying too much about what happened in the >> past. Just note when you happen to do a 'reload', and so understand that if >> seen at that time that that was likely the cause. >> >> Once things settle down for a while then can look at your >> processes/threads configuration and capacity utilisation and so work out >> whether a poor configuration there is causing problems. Can also see >> whether you are getting stuck requests as well which are eating into >> capacity and which could result in some of the errors you were seeing when >> any sort of restart was done. >> >> So slowly slowly. Resolve one issue at a time. Changing too many things >> at once will only confuse and you will not get a sense of what changes are >> actually making things better otherwise. >> >> Graham >> >> >> Thanks again. >> >> >> On Friday, October 31, 2014 12:29:12 PM UTC+8, Graham Dumpleton wrote: >> >> So lets look at just memory initially. >> >> When we talk about Apache processes, there are actually going to be three >> different classes of processes. >> >> These are: >> >> 1. The Apache parent process. >> >> This runs as root. Because it runs as root and not the Apache user >> (www-data) that the Apache child worker processes run as, then it can be >> seen separately on the NR server monitoring. >> >> There is no memory increase here. It is actually very important that the >> Apache parent process doesn't increase in size over time with restarts as >> when it forks child process they would inherit that increasing memory size. >> One of the reason that using mod_python is a bad idea is that it used to >> initialise the Python interpreter in the Apache parent process and when >> restarted, the Python interpreter would leak memory and so the Apache >> parent process would grow in size on each restart. >> >> Anyway, you are good here and nothing to worry about. >> >> 2. The Apache child worker processes. >> >> These run as the Apache user which on your system is www-data. >> >> The Apache child worker processes are what accepts all inbound HTTP >> requests. Static files will be served up by these processes. >> >> In your case where you are using mod_wsgi daemon processes, the Python >> web application does not run in the Apache child worker processes. The >> Apache child worker process still accept the requests initially though and >> would then proxy the request through to the daemon process running the WSGI >> application. >> >> 3. The mod_wsgi daemon processes. >> >> By default these also run as the Apache user which on your system is >> www-data. >> >> This is where your WSGI application is running. >> >> Now both the Apache child worker processes and the mod_wsgi daemon >> processes are forked children of the Apache parent process. They will >> therefore all show as being the process 'apache2' on your system. >> >> Because both (1) and (2) are running as the Apache user, NR server >> monitoring will not allow you to distinguish between them. >> >> Thus when you get: >> >> you aren't looking at just the Apache child worker processes and are also >> looking at the mod_wsgi daemon processes. >> >> The only way you could use NR server monitoring to tell between the two >> is if you used the 'user' option to WSGIDaemonProcess and ran the mod_wsgi >> daemon processes as a different user again. Then you could distinguish. >> >> Although you can't distinguish using NR server monitoring, you can on >> Linux if you ensure you are naming the processes. That is, when using >> WSGIDaemonProcess directive, supply the display-name=%{GROUP} option. This >> will allow you to see the mod_wsgi daemon processes as distinct if using >> 'ps' or 'htop'. >> >> Lets instead now look at what the mod_wsgi plugin tells you: >> >> Over the 24 hours matching last charts we get: >> >> so nothing conclusive. For a shorter period of one hour we get: >> >> This shows an increase over time, but you have to be a little bit careful >> in interpreting that. If you hover over the memory chart and look at the >> popup you will see: >> >> This shows that the processes actually range in size. There is a range of >> sizes because the process churn chart shows that processes are being >> restarted at different times. >> >> Important though is to look at the max value. This is the largest size of >> any process. If one moves along the chart for that period one sees that the >> maximum is always about 12.9-13.0 MB. >> >> This indicates to me that the Apache child worker process size has >> reached a plateau and that this is the normal behaviour of Apache child >> process to grow over time up to a plateau which I believe I noted >> previously. >> >> So for the Apache child worker processes this isn't ringing any alarm >> bells for me. >> >> If you have already removed MaxConnectionsPerChild and that maximum >> continues along and not increasing too much more over time, then I wouldn't >> be seeing a need to be setting MaxConnectionsPerChild. >> >> Next chart to look at is the memory chart where monitoring the actual >> Python web application. >> >> Over the last 6 hours this shows: >> >> I am not 100% sure if you can get to this chart on the Lite subscription >> level you are on but I can still see it. >> >> If it is available on Lite subscription, go to the bottom of the overview >> dashboard and select on the chart icon for the server breakout. >> >> So as far as what process is growing in memory size, it is very much the >> mod_wsgi daemon processes. >> >> The problem therefore is going to be an issue with caching in your >> application, construction of object cycles that cannot be broken by the >> garbage collector, or in worst case the garbage collector isn't even >> running due to it dead locking. >> >> As a short term remedy to at least keep the memory growth under control >> for the mod_wsgi daemon processes, you have two choices. >> >> The first is to use the 'inactivity-timeout' option to WSGIDaemonProcess >> directive. >> >> The behaviour of this is slightly different between mod_wsgi 3.X and 4.X. >> >> Under 4.X what will happen when inactivity-timeout is set is that if no >> requests at all are received by a daemon process group, and there are no >> active ones, after the specified period, then the process will be restarted. >> >> You run up to 150 requests per minute across about 7 process instances. >> >> >> In practice this means that any one process is not even handling one >> request per second. >> >> Even so, what it means is that if you set this to be 30 seconds, then a >> process is never going to be idle long enough to be restarted. Can't >> therefore use this. >> >> On mod_wsgi 3.X the difference was that the reading of request content >> and writing of response content was monitored and if there was no activity >> on those after the timeout period, then the process would be restarted. >> With the way it worked, it meant if there was a long running request, or >> stuck requests, they would be interrupted. In mod_wsgi 4.X, you should be >> using request-timeout instead if the intent is to be able to interrupt long >> running or stuck requests. >> >> The second option you can use instead of inactivity-timeout is the >> maximum-requests option to WSGIDaemonProcess. >> >> This also has changed a bit in mod_wsgi 4.X from 3.X. >> >> Under 3.X once that many requests were processed by a process, the >> process would be restarted. In restarting though, it followed the typical >> shutdown sequence for mod_wsgi daemon processes. >> >> That is, no new requests would be accepted on that process. Currently >> running requests would be given a grace period to complete (default 2.5 >> seconds). If they did not complete in that time, the process would be more >> brutally shutdown. >> >> The result of this was that long running requests could be interrupted, >> resulting in some of those errors you saw before. >> >> In mod_wsgi 4.X this option has the same default behaviour, but there is >> a new way of having the restart be more graceful. >> >> What you instead use is both maximum-requests and graceful-timeout to >> WSGIDaemonProcess at the same time. >> >> When you also have graceful-timeout option specified, what will happen >> when the maximum requests is reached is that a new initial grace period is >> entered based on what graceful-timeout is set to. >> >> During this initial grace period, new requests will still be accepted, >> but if at any point during that time there are no active requests, the >> process will be restarted immediately. >> >> If for some reason the initial grace period expires, then the normal >> restart starts. That is, stop accepting requests, provide a further 2.5 >> seconds grace and then brutally restart. >> >> So this new graceful-timeout option in conjunction with maximum-requests >> means that a process can be restarted but with a better ability to do so >> without interrupting any requests. >> >> In your case because there is less than 1 request per second on average >> across all processes, there should be a lot of opportunity for a restart to >> occur in the gaps in between those infrequent requests. >> >> My recommendations at this point are: >> >> 1. Set MaxConnectionsPerChild to 0 so that Apache child worker processes >> aren't being recycled unnecessarily. >> >> Monitor the mod_wsgi plugin chart 'Average Memory Useds (bytes/process)' >> to ensure that the Apache child worker processes aren't growing in size in >> an unbounded way above the typical max you are seeing now. They should >> simply plateau after a while. Keep an eye on max in the popup rather than >> the average as the latter can be deceiving as Apache child worker processes >> can still be restarted by Apache because of its dynamic scaling feature, >> thereby confusing the results you see. >> >> 2. Add to WSGIDaemonProcess the maximum-requests and graceful-timeout >> option. >> >> As you are running only about 20 requests per minute for process, set >> maximum-requests at about 250 initially, which would be about every 10 >> minutes or so at that rate. If the processes are getting recycled too often >> and the startup cost is too much, then increase it. At the same time though >> keep an eye on that memory chart in the server breakout on the main >> dashboard to ensure memory not growing too much before the process gets >> recycled. >> >> At the same time, make sure you set graceful-timeout. A safe figure for >> this initially would be 30. That is the number of seconds to wait to see if >> can restart when no requests, before a more brutal restart. >> >> 3. If you haven't done it already and you are definitely only using >> mod_wsgi daemon mode for running your WSGI application, ensure you are >> setting WSGIRestrictEmbedded to On at global scope in the Apache >> configuration outside of any VirtualHost. >> >> So as an initial remedial action only, to combat memory growth, do that >> for now. This will give you some breathing space so you can work out why >> your web application is growing memory usage continually. >> >> When done and when it has been running for 6-12 hours, come back and we >> can review how the memory growth charts are looking. I can also get to some >> other charts you can't when on Lite subscription, which will show details >> about the mod_wsgi daemon process restarts and we can see how far apart >> they are coming and whether we need to adjust maximum-requests. >> >> As I say though, this is just to get you in a better position while you >> work out your memory leakage problems. >> >> A next step I would like to do is have you install an extension to the NR >> Python agent which allows the activities of the garbage collector to be >> monitored to make sure it is running and whether uncollectable objects are >> accumulating. You would not normally be able to use this with a Lite >> account but I will be able to see the data. Before I can give you that >> extension to try I need to wait on some internal approvals to release it, >> which are already in train due to us needing to provide it to some other >> users who are seeing memory growth issues which we are trying to help them >> with. >> >> For interests sake, the charts it can produce are: >> >> [image: custom-dashboard.jpg] >> Anyway, work slowy through what I have said above and understand it. Get >> those few configuration changes made and then we will look at the next >> steps. >> >> Graham >> >> On 31/10/2014, at 1:12 PM, Kelvin Wong <[email protected]> wrote: >> >> I'm sorry to throw too much questions to you. I just want to provide more >> information to make it easier to find the cause. >> Also, I rushed to you because this situation happened during sleep time. >> So I'm little stressed about it. >> >> Back to your question. >> After following your instructions to setup mod_wsgi plugin to monitor the >> mod_wsgi processes in New Relic, I take a look on the data. >> I found that it seemed normal to me. Before the 503 situation, "Average >> Memory Used (bytes/process)" and "Process Count" do not have great changes. >> But looking data from "Server", the Apache processes still accumulating. >> So, can we conclude that Apache child worker processes growing in memory? >> >> Also, as I post previously, the main Apache error log, >> [Wed Oct 29 12:56:26.727197 2014] [mpm_worker:error] [pid 1322:tid >> 139958218430336] AH00287: server is within MinSpareThreads of >> MaxRequestWorkers, consider raising the MaxRequestWorkers setting >> [Wed Oct 29 12:56:30.730902 2014] [mpm_worker:error] [pid 1322:tid >> 139958218430336] AH00286: server reached MaxRequestWorkers setting, >> consider raising the MaxRequestWorkers setting >> >> I changed the config accordingly, the 503 situation still happened. >> <IfModule mpm_worker_module> >> StartServers 2 >> MinSpareThreads 25 >> MaxSpareThreads 75 >> ThreadLimit 64 >> ThreadsPerChild 25 >> MaxRequestWorkers 400 >> MaxConnectionsPerChild 0 >> </IfModule> >> >> >> Thanks. >> >> >> On Friday, October 31, 2014 3:30:58 AM UTC+8, Graham Dumpleton wrote: >> >> Before I can help you, you need to start answering my questions. If you >> pepper me with email after email on different things and don't answer the >> questions I ask you, then we will get no where. When I ask questions it is >> because that information is likely key to me being able to understand the >> problem and help you. >> >> The main question I put to you was about how you are determining which >> processes are growing in memory. >> >> It needs to be properly determined whether the processes which were >> growing in memory were Apache child worker processes or mod_wsgi daemon >> processes. >> >> The reason for this is that if they are in fact mod_wsgi daemon >> processes, then you are possibly in part causing the problem yourself by >> setting MaxConnectionsPerChild. That setting has got nothing to do with >> the mod_wsgi daemon processes and by setting it, you are creating the >> conditions which cause the sorts of errors you are seeing. >> >> So, which processes are growing in memory? >> >> Graham >> >> On 31/10/2014, at 12:40 AM, Kelvin Wong <[email protected]> wrote: >> >> I found that on the main Apache log, when it is occurring, Apache is >> doing graceful restart. >> How can I avoid Apache graceful restart? >> >> Thanks. >> >> >> On Tuesday, October 28, 2014 1:30:06 PM UTC+8, Graham Dumpleton wrote: >> >> Would suggest upgrading to mod_wsgi 4.3.0 if you can as the error >> messages when there are communication problems between Apache child worker >> process and mod_wsgi daemon process have been improved. >> >> More comments below. >> >> On 28 October 2014 15:43, Kelvin Wong <[email protected]> wrote: >> >> Hi Graham and everyone else >> >> I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), >> OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker. >> I found that the server start returning 504 and then 503, and the >> following error shown up. >> I researched some issues related with it, even added "WSGISocketPrefix >> /var/run/apache2/wsgi", but the issue still occured. >> I have no >> >> ... > > -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
