I obtained log messages on production when the error happened. The log level is set to 'info'.
[Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'. [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'. [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Destroying interpreters. [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''. [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'. [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'. [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Destroying interpreters. [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''. [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'. [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'. [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Destroying interpreters. [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''. [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Terminating Python. [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Python has shutdown. [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'. [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Terminating Python. [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Python has shutdown. [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'. [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Terminating Python. [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Python has shutdown. [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'. [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid 140701807068928] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid 140701660579584] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid 140701823854336] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid 140701593437952] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid 140701652186880] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid 140701840639744] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid 140701719328512] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid 140701685757696] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid 140701823854336] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid 140701585045248] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid 140701652186880] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid 140701719328512] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid 140701610223360] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid 140701601830656] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid 140701660579584] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid 140701807068928] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid 140701601830656] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid 140701668972288] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid 140701610223360] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid 140701815461632] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering immediate shutdown 'site-1'. [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'. [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'. [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Destroying interpreters. [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''. [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent INFO - New Relic Python Agent Shutdown [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid 140701710935808] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. Please advise. Thanks. On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote: > > On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected] <javascript:>> > wrote: > > I don't think the original error situation is caused by server overloading > as the traffic of the server on that time it's fine. > > > I didn't say it was. > > I just want to rule out some conditions as returning 503 can be caused by > overloading the service. > > Overloading the servers can produce 503 but the error log messages do not > match with the original error log message. > > > And I am only addressing the message: > > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > That area of the code was rewritten in 4.X versions at some point and > introduced non blocking connects. Looking more at that specific message and > knowing the possibility that you were load testing I came across that > difference in Linux behaviour compared to what POSIX specifies. > > So all I can do is address that specific point and that it is all I can do > until I can get those info level messages from the error logs so I can > understand whether the mod_wsgi daemon processes are actually restarting at > the same time. > > Graham > > > On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote: > > I still want any log messages either side of that point as it may confirm > what is going on, but what you might be encountering could be due to Linux > behaviour when performing a connect() on a UNIX socket in non blocking mode > not being POSIX compliant. Details: > > http://lists.mysql.com/internals/35082 > > The mod_wsgi code is possibly not dealing with Linux's non POSIX > compliance and instead of interpreting it the same as connection refused > (when the listener socket backlog is full), it is regarding it as an > outright failure. > > So the cause of this if you are trying to do load testing is that you are > overwhelming the capacity of the mod_wsgi daemon processes causing the > socket listener queue to fill up. When it fills up then it is triggering > the non POSIX behaviour of Linux. > > Can you confirm that you only saw this when you overwhelmed the server > with more traffic that you would realistically expect it to receive and way > beyond the capacity it could handle based on how it is configured? > > Graham > > On 03/11/2014, at 8:33 PM, Graham Dumpleton <[email protected]> wrote: > > As I said before, I need to see messages either side of the event and with > Apache log level set to info. > > Graham > > On 03/11/2014, at 8:28 PM, Kelvin Wong <[email protected]> wrote: > > I tried some load testing on the server to see if it is caused by > overloading. > > I found there will be some other log message when the server is > overloading. > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] > [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or > oversized response headers received from daemon process 'site-1': > /home/ubuntu/site-1/apache/wsgi.py > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > Then I concluded that the situation is not caused by overloading the > server. > > Do you have any directions for me to reproduce the situation? > > Thanks. > > > On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote: > > I found no cron is running on root and user. > > According to the logrotate config, the log rotate weekly. > But the error situation is likely happened nightly. > > So I think it will not be my concern. > > > On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote: > > And when is 'cron' configured to run logrotate? Does it correspond to when > you are seeing the issue? > > On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote: > > Yes, the memory usage is under control now and I increased > maximum-requests by 4 times. > > For the error, > Timeout when reading response headers from daemon process 'site-1': /home/ > ubuntu/site-1/apache/wsgi.py > came out as sudden, that's why I capture this message only. I also found > main Apache error log nothing special. > > Now, I turned the log level to 'info' in order to obtain more messages. > > About the log file rotation, I only found the ubuntu default rotation one > which is using logrotate. > /var/log/apache2/*.log { > weekly > missingok > rotate 52 > compress > delaycompress > notifempty > create 640 root adm > sharedscripts > postrotate > if /etc/init.d/apache2 status > /dev/null ; then \ > /etc/init.d/apache2 reload > /dev/null; \ > fi; > endscript > prerotate > if [ -d /etc/logrotate.d/httpd-prerotate ]; then \ > run-parts /etc/logrotate.d/httpd-prerotate; \ > fi; \ > endscript > } > It will do Apache reload on postrotate. I research a little bit, the > reload is needed for Apache handling file. > Some suggested "copytruncate" on the config to avoid reload. Will you also > suggest that? > > > Thanks. > > > On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote: > > > On 02/11/2014, at 2:18 AM, Kelvin Wong <[email protected]> wrote: > > > I have some updates on the issue as it happened again. This time I > captured some data from New Relic. > > From the Requests.png, I found that there is an spike aroung 22:23 and > check the Apache log. I found there are many as following. > Timeout when reading response headers from daemon process 'site-1': /home/ > ubuntu/site-1/apache/wsgi.py > > Spike also found on the Workers.png during that time period. The graph is > showing most of the worker is doing write? > After a few minute, the following log appear again, and start returning > 503. > (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi > (pid=27816): Unable to connect to WSGI daemon process 'site-1' on > '/var/run/apache2/wsgi.17227.2.3.sock'. > > Can you tell what happened from those graphs? > > > What I need to see is more of the log file around those events. > > Ensure you have LogLevel set to 'info' in the Apache configuration as > mod_wsgi will then log more messages about why processes restarted. > > Also ensure you are checking for message around the time in the global > Apache error log and not just in any virtual host specific error log, as > some messages, such as evidence of a restart or process crashes will only > appear in the main Apache error log. > > With memory growth under control: > > > turning to look at daemon process restarts: > > > they are obviously more frequent. Zooming in, each process is restarting > about every 5 minutes, although varies over time. > > I would suggest you can increase maximum-requests, possibly as high as 4 > times what you are using. > > When enabling 'info' logging that will reduce the amount of noise in the > error logs which may hide the specific anomalous event you are trying to > track down. > > BTW. Did you look into whether any log file rotation is set up or not and > what did you find? > > Depending on the system you might be able to find whether something is > running by looking at one cron jobs are running for root. > > sudo crontab -l > > Graham > > > Thanks. > > > On Friday, October 31, 2014 5:08:27 PM UTC+8, Graham Dumpleton wrote: > > > On 31/10/2014, at 7:26 PM, Kelvin Wong <[email protected]> wrote: > > Oh. I see. Thanks. > > One more question. > As I set the graceful-timeout on mod_wsgi process, then when Apache do > graceful restart, will mod_wsgi obey that timeout setting? > > > No. It is impossible. There is simply no way for the mod_wsgi daemon > processes to act in coordination with the Apache child worker processes on > an Apache graceful restart. > > This is because the only way that mod_wsgi can create its distinct child > processes and have them managed by Apache at all is using a mechanism that > in the Apache runtime library is referred to as a 'other processes'. > > It doesn't matter whether Apache as a whole is signalled for a normal > shutdown (SIGHUP) or a graceful restart (SIGUSR1), Apache will in turn > always send a SIGINT on to 'other processes' such as those created by > mod_wsgi. If those 'other processes' don't shutdown immediately it will > signal them again at one second intervals until 3 seconds have elapsed, at > which point it will send them a SIGKILL. So Apache only supports a brutal > shutdown of that sort for 'other processes'. > > For its own child worker processes, to even support a graceful restart is > a non trivial thing. It isn't just a matter of sending the child worker > processes a SIGUSR1. Instead there is an elaborate scheme involving what is > called a pipe of death between the Apache parent and the child worker > processes, as well as shared state via shared memory between the Apache > parent process so the child worker processes now what sort of shutdown has > been triggered. That sort of mechanism to support a graceful shutdown for > 'other processes' simply does exist. > > So it is how it is and there is no way to change things. In practice it > hasn't caused any issues and web requests being interrupted is always a > risk that a web application has to deal with anyway. > > Graham > > On Friday, October 31, 2014 4:16:55 PM UTC+8, Graham Dumpleton wrote: > > > > On 31/10/2014, at 6:53 PM, Kelvin Wong <[email protected]> wrote: > > You means that the OS triggered Apache graceful restart? If so, why the OS > do that? > > > Yes and I just told you why, as part of a log rotation mechanism. > > Try googling 'log file rotation' and see what mechanism is used on your > specific Linux system. > > One link about one such solution which is often used is: > > http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/ > > Apache does itself have a way of handling log rotation itself, but likely > ignored by Linux distros which use their own preferred mechanism. > > Graham > > > On Friday, October 31, 2014 3:45:35 PM UTC+8, Graham Dumpleton wrote: > > Your operating system could be signalling Apache during the middle of the > night so that the log files can be rotated. > > Graham > > On 31/10/2014, at 6:36 PM, Kelvin Wong <[email protected]> wrote: > > I checked the memory of overall Apache process, it goes up and down a > little. It Seemed the "maximum-requests" is doing its job. > > I found that when I do "sudo service apache reload", Apache will do > graceful restart as it shown in main Apache error log. > > > [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid > 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx > :xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process > 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'. > When before above situation happened, Apache also doing graceful restart. > But this should not be triggered by anyone as I mentioned it happened > during sleep time. > Is there any reasons that Apache do graceful restart by itself? > > Thanks. > > > On Friday, October 31, 2014 2:55:32 PM UTC+8, Graham Dumpleton wrote: > > BTW, in your NR agent configuration file find the > error_collector.ignore_errors setting and change it to: > > error_collector.ignore_errors = django.http.request:UnreadablePostError > > These indicate the HTTP client connection dropping before all content had > been sent. > > Generally you want to ignore these and eliminate them from your error > count as they do not represent an error in your application but are the > result of a client problem. > > Adding that configuration setting will cause them to be ignored and not > recorded as an error. > > Graham > > On 31/10/2014, at 5:41 PM, Graham Dumpleton <[email protected]> wrote: > > > On 31/10/2014, at 5:04 PM, Kelvin Wong <[email protected]> wrote: > > Thanks for the details! > > I now have a better picture on how it works. > I employed all the recommendations you suggested. I will keep monitoring > the processes. > > But how about the graceful restart signaled by Apache? Why it happened? > > > As far as the message: > > [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid > 140182690981632] (2)No such > file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): Unable > to > connect to WSGI daemon > process 'snaptee-production-api-ssl' on'/var/run/apache2/wsgi.30188.7.3.sock'. > > goes, as I said I believe that could only be the result of an Apache > graceful restart. > > My recollection is that a 'reload' under a Linux system using its process > management tools triggers an Apache graceful restart where as a 'restart' > does a normal Apache restart. > > I believe you said you used 'reload' and so that would be the source of > that. > > The whole reason though I have been trying to sort out what processes > actually had memory issues though was to avoid MaxConnectionsPerChild as > well to ensure that that wasn't a source of the problem as well. > > Now that we have you on a better configuration lets see now whether it > happens again, rather than worrying too much about what happened in the > past. Just note when you happen to do a 'reload', and so understand that if > seen at that time that that was likely the cause. > > Once things settle down for a while then can look at your > processes/threads configuration and capacity utilisation and so work out > whether a poor configuration there is causing problems. Can also see > whether you are getting stuck requests as well which are eating into > capacity and which could result in some of the errors you were seeing when > any sort of restart was done. > > So slowly slowly. Resolve one issue at a time. Changing too many things at > once will only confuse and you will not get a sense of what changes are > actually making things better otherwise. > > Graham > > > Thanks again. > > > On Friday, October 31, 2014 12:29:12 PM UTC+8, Graham Dumpleton wrote: > > So lets look at just memory initially. > > When we talk about Apache processes, there are actually going to be three > different classes of processes. > > These are: > > 1. The Apache parent process. > > This runs as root. Because it runs as root and not the Apache user > (www-data) that the Apache child worker processes run as, then it can be > seen separately on the NR server monitoring. > > There is no memory increase here. It is actually very important that the > Apache parent process doesn't increase in size over time with restarts as > when it forks child process they would inherit that increasing memory size. > One of the reason that using mod_python is a bad idea is that it used to > initialise the Python interpreter in the Apache parent process and when > restarted, the Python interpreter would leak memory and so the Apache > parent process would grow in size on each restart. > > Anyway, you are good here and nothing to worry about. > > 2. The Apache child worker processes. > > These run as the Apache user which on your system is www-data. > > The Apache child worker processes are what accepts all inbound HTTP > requests. Static files will be served up by these processes. > > In your case where you are using mod_wsgi daemon processes, the Python web > application does not run in the Apache child worker processes. The Apache > child worker process still accept the requests initially though and would > then proxy the request through to the daemon process running the WSGI > application. > > 3. The mod_wsgi daemon processes. > > By default these also run as the Apache user which on your system is > www-data. > > This is where your WSGI application is running. > > Now both the Apache child worker processes and the mod_wsgi daemon > processes are forked children of the Apache parent process. They will > therefore all show as being the process 'apache2' on your system. > > Because both (1) and (2) are running as the Apache user, NR server > monitoring will not allow you to distinguish between them. > > Thus when you get: > > you aren't looking at just the Apache child worker processes and are also > looking at the mod_wsgi daemon processes. > > The only way you could use NR server monitoring to tell between the two is > if you used the 'user' option to WSGIDaemonProcess and ran the mod_wsgi > daemon processes as a different user again. Then you could distinguish. > > Although you can't distinguish using NR server monitoring, you can on > Linux if you ensure you are naming the processes. That is, when using > WSGIDaemonProcess directive, supply the display-name=%{GROUP} option. This > will allow you to see the mod_wsgi daemon processes as distinct if using > 'ps' or 'htop'. > > Lets instead now look at what the mod_wsgi plugin tells you: > > Over the 24 hours matching last charts we get: > > so nothing conclusive. For a shorter period of one hour we get: > > This shows an increase over time, but you have to be a little bit careful > in interpreting that. If you hover over the memory chart and look at the > popup you will see: > > This shows that the processes actually range in size. There is a range of > sizes because the process churn chart shows that processes are being > restarted at different times. > > Important though is to look at the max value. This is the largest size of > any process. If one moves along the chart for that period one sees that the > maximum is always about 12.9-13.0 MB. > > This indicates to me that the Apache child worker process size has reached > a plateau and that this is the normal behaviour of Apache child process to > grow over time up to a plateau which I believe I noted previously. > > So for the Apache child worker processes this isn't ringing any alarm > bells for me. > > If you have already removed MaxConnectionsPerChild and that maximum > continues along and not increasing too much more over time, then I wouldn't > be seeing a need to be setting MaxConnectionsPerChild. > > Next chart to look at is the memory chart where monitoring the actual > Python web application. > > Over the last 6 hours this shows: > > I am not 100% sure if you can get to this chart on the Lite subscription > level you are on but I can still see it. > > If it is available on Lite subscription, go to the bottom of the overview > dashboard and select on the chart icon for the server breakout. > > So as far as what process is growing in memory size, it is very much the > mod_wsgi daemon processes. > > The problem therefore is going to be an issue with caching in your > application, construction of object cycles that cannot be broken by the > garbage collector, or in worst case the garbage collector isn't even > running due to it dead locking. > > As a short term remedy to at least keep the memory growth under control > for the mod_wsgi daemon processes, you have two choices. > > The first is to use the 'inactivity-timeout' option to WSGIDaemonProcess > directive. > > The behaviour of this is slightly different between mod_wsgi 3.X and 4.X. > > Under 4.X what will happen when inactivity-timeout is set is that if no > requests at all are received by a daemon process group, and there are no > active ones, after the specified period, then the process will be restarted. > > You run up to 150 requests per minute across about 7 process instances. > > > In practice this means that any one process is not even handling one > request per second. > > Even so, what it means is that if you set this to be 30 seconds, then a > process is never going to be idle long enough to be restarted. Can't > therefore use this. > > On mod_wsgi 3.X the difference was that the reading of request content and > writing of response content was monitored and if there was no activity on > those after the timeout period, then the process would be restarted. With > the way it worked, it meant if there was a long running request, or stuck > requests, they would be interrupted. In mod_wsgi 4.X, you should be using > request-timeout instead if the intent is to be able to interrupt long > running or stuck requests. > > The second option you can use instead of inactivity-timeout is the > maximum-requests option to WSGIDaemonProcess. > > This also has changed a bit in mod_wsgi 4.X from 3.X. > > Under 3.X once that many requests were processed by a process, the process > would be restarted. In restarting though, it followed the typical shutdown > sequence for mod_wsgi daemon processes. > > That is, no new requests would be accepted on that process. Currently > running requests would be given a grace period to complete (default 2.5 > seconds). If they did not complete in that time, the process would be more > brutally shutdown. > > The result of this was that long running requests could be interrupted, > resulting in some of those errors you saw before. > > In mod_wsgi 4.X this option has the same default behaviour, but there is a > new way of having the restart be more graceful. > > What you instead use is both maximum-requests and graceful-timeout to > WSGIDaemonProcess at the same time. > > When you also have graceful-timeout option specified, what will happen > when the maximum requests is reached is that a new initial grace period is > entered based on what graceful-timeout is set to. > > During this initial grace period, new requests will still be accepted, but > if at any point during that time there are no active requests, the process > will be restarted immediately. > > If for some reason the initial grace period expires, then the normal > restart starts. That is, stop accepting requests, provide a further 2.5 > seconds grace and then brutally restart. > > So this new graceful-timeout option in conjunction with maximum-requests > means that a process can be restarted but with a better ability to do so > without interrupting any requests. > > In your case because there is less than 1 request per second on average > across all processes, there should be a lot of opportunity for a restart to > occur in the gaps in between those infrequent requests. > > My recommendations at this point are: > > 1. Set MaxConnectionsPerChild to 0 so that Apache child worker processes > aren't being recycled unnecessarily. > > Monitor the mod_wsgi plugin chart 'Average Memory Useds (bytes/process)' > to ensure that the Apache child worker processes aren't growing in size in > an unbounded way above the typical max you are seeing now. They should > simply plateau after a while. Keep an eye on max in the popup rather than > the average as the latter can be deceiving as Apache child worker processes > can still be restarted by Apache because of its dynamic scaling feature, > thereby confusing the results you see. > > 2. Add to WSGIDaemonProcess the maximum-requests and graceful-timeout > option. > > As you are running only about 20 requests per minute for process, set > maximum-requests at about 250 initially, which would be about every 10 > minutes or so at that rate. If the processes are getting recycled too often > and the startup cost is too much, then increase it. At the same time though > keep an eye on that memory chart in the server breakout on the main > dashboard to ensure memory not growing too much before the process gets > recycled. > > At the same time, make sure you set graceful-timeout. A safe figure for > this initially would be 30. That is the number of seconds to wait to see if > can restart when no requests, before a more brutal restart. > > 3. If you haven't done it already and you are definitely only using > mod_wsgi daemon mode for running your WSGI application, ensure you are > setting WSGIRestrictEmbedded to On at global scope in the Apache > configuration outside of any VirtualHost. > > So as an initial remedial action only, to combat memory growth, do that > for now. This will give you some breathing space so you can work out why > your web application is growing memory usage continually. > > When done and when it has been running for 6-12 hours, come back and we > can review how the memory growth charts are looking. I can also get to some > other charts you can't when on Lite subscription, which will show details > about the mod_wsgi daemon process restarts and we can see how far apart > they are coming and whether we need to adjust maximum-requests. > > As I say though, this is just to get you in a better position while you > work out your memory leakage problems. > > A next step I would like to do is have you install an extension to the NR > Python agent which allows the activities of the garbage collector to be > monitored to make sure it is running and whether uncollectable objects are > accumulating. You would not normally be able to use this with a Lite > account but I will be able to see the data. Before I can give you that > extension to try I need to wait on some internal approvals to release it, > which are already in train due to us needing to provide it to some other > users who are seeing memory growth issues which we are trying to help them > with. > > For interests sake, the charts it can produce are: > > [image: custom-dashboard.jpg] > Anyway, work slowy through what I have said above and understand it. Get > those few configuration changes made and then we will look at the next > steps. > > Graham > > On 31/10/2014, at 1:12 PM, Kelvin Wong <[email protected]> wrote: > > I'm sorry to throw too much questions to you. I just want to provide more > information to make it easier to find the cause. > Also, I rushed to you because this situation happened during sleep time. > So I'm little stressed about it. > > Back to your question. > After following your instructions to setup mod_wsgi plugin to monitor the > mod_wsgi processes in New Relic, I take a look on the data. > I found that it seemed normal to me. Before the 503 situation, "Average > Memory Us > > ... -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
