I attached the main Apache error log for the same period of time.

For the system setup,
1. I have no backup process for that instance. it's a web app server only.

2. Except the web application, no process connecting to database.

3. I used AWS Elastic load balancer and New Relic pinger to monitor the 
liveness of the website, not specified for Apache. All restart will be 
triggered manually for now.


Thanks.


On Tuesday, November 4, 2014 5:43:28 PM UTC+8, Graham Dumpleton wrote:
>
> Can you supply the main Apache error log for the same period of time? Not 
> all information is in the virtual host specific error log, including key 
> information which captures when an Apache restart was triggered.
>
> Based on the incomplete information from just that log, it opens the 
> possibility that the restart was actually the outcome of problems which 
> arose and not the trigger. I really need that main Apache error log to be 
> sure.
>
> As well as supplying the main Apache error log, can you indicate whether 
> your system is set up with any of these things.
>
> 1. Any sort of backup process, be it file system backups or database 
> backups. If there are such backup services, what time of day do they run.
>
> 2. Any sort of intensive batch update processing which loads data into any 
> database at a specific time of day.
>
> 3. Do you have any monitoring system which checks for the liveness of 
> Apache and which would force a restart if it was unable to poll Apache and 
> thus though it was hung.
>
> Lack of a permalink from NR UI (bottom right of pages) makes it hard to 
> find the specific time of day for the event you are talking about because 
> of uncertainties of timezones, but I see evidence of various other strange 
> things going on at times. Lets see those main Apache error logs first 
> though and then can properly come up with a plausible explanation of what 
> may have triggered things.
>
> Graham
>
> On 04/11/2014, at 2:53 AM, Kelvin Wong <[email protected] <javascript:>> 
> wrote:
>
> I attached the error log. See it is enough.
>
>
> On Monday, November 3, 2014 8:10:00 PM UTC+8, Graham Dumpleton wrote:
>
> Can you provide further back in time and further forward in time.
>
> Basically, from when everything looked normal until everything was back to 
> normal.
>
> I get the feel that this has chopped off bits and not everything I might 
> want to see is there.
>
> Sleep time here. Will then look tomorrow.
>
> Graham
>
> On 03/11/2014, at 10:25 PM, Kelvin Wong <[email protected]> wrote:
>
> I obtained log messages on production when the error happened. The log 
> level is set to 'info'.
>
> [Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'.
> [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'.
> [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Destroying interpreters.
> [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'.
> [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'.
> [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Destroying interpreters.
> [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'.
> [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'.
> [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Destroying interpreters.
> [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Terminating Python.
> [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Python has shutdown.
> [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'.
> [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Terminating Python.
> [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Python has shutdown.
> [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'.
> [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Terminating Python.
> [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Python has shutdown.
> [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'.
> [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid 
> 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid 
> 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid 
> 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid 
> 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid 
> 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid 
> 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid 
> 140701807068928] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid 
> 140701660579584] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid 
> 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid 
> 140701823854336] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid 
> 140701593437952] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid 
> 140701652186880] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid 
> 140701840639744] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid 
> 140701719328512] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid 
> 140701685757696] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid 
> 140701823854336] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid 
> 140701585045248] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid 
> 140701652186880] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid 
> 140701719328512] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid 
> 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid 
> 140701610223360] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid 
> 140701601830656] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid 
> 140701660579584] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid 
> 140701807068928] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid 
> 140701601830656] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid 
> 140701668972288] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid 
> 140701610223360] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid 
> 140701815461632] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid 
> 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering 
> immediate shutdown 'site-1'.
> [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'.
> [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'.
> [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Destroying interpreters.
> [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''.
> [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid 
> 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent 
> INFO - New Relic Python Agent Shutdown
> [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid 
> 140701710935808] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
>
> Please advise. Thanks.
>
>
> On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote:
>
> On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected]> wrote:
>
> I don't think the original error situation is caused by server overloading 
> as the traffic of the server on that time it's fine.
>
>
> I didn't say it was.
>
> I just want to rule out some conditions as returning 503 can be caused by 
> overloading the service.
>
> Overloading the servers can produce 503 but the error log messages do not 
> match with the original error log message.
>
>
> And I am only addressing the message:
>
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] 
> [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: 
> [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI 
> daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.
>
> That area of the code was rewritten in 4.X versions at some point and 
> introduced non blocking connects. Looking more at that specific message and 
> knowing the possibility that you were load testing I came across that 
> difference in Linux behaviour compared to what POSIX specifies.
>
> So all I can do is address that specific point and that it is all I can do 
> until I can get those info level messages from the error logs so I can 
> understand whether the mod_wsgi daemon processes are actually restarting at 
> the same time.
>
> Graham
>
>
> On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote:
>
> I still want any log messages either side of that point as it may confirm 
> what is going on, but what you might be encountering could be due to Linux 
> behaviour when performing a connect() on a UNIX socket in non blocking mode 
> not being POSIX compliant. Details:
>
> http://lists.mysql.com/internals/35082
>
> The mod_wsgi code is possibly not dealing with Linux's non POSIX 
> compliance and instead of interpreting it the same as connection refused 
> (when the listener socket backlog is full), it is regarding it as an 
> outright failure.
>
> So the cause of this if you are trying to do load testing is that you are 
> overwhelming the capacity of the mod_wsgi daemon processes causing the 
> socket listener queue to fill up. When it fills up then it is triggering 
> the non POSIX behaviour of Linux.
>
> Can you confirm that you only saw this when you overwhelmed the server 
> with more traffic that you would realistically expect it to receive and way 
> beyond the capacity it could handle based on how it is configured?
>
> Graham
>
> On 03/11/2014, at 8:33 PM, Graham Dumpleton <[email protected]> wrote:
>
> As I said before, I need to see messages either side of the event and with 
> Apache log level set to info.
>
> Graham
>
> On 03/11/2014, at 8:28 PM, Kelvin Wong <[email protected]> wrote:
>
> I tried some load testing on the server to see if it is caused by 
> overloading.
>
> I found there will be some other log message when the server is 
> overloading.
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] 
> [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or 
> oversized response headers received from daemon process 'site-1': 
> /home/ubuntu/site-1/apache/wsgi.py
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] 
> [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: 
> [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI 
> daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.
>
> Then I concluded that the situation is not caused by overloading the 
> server.
>
> Do you have any directions for me to reproduce the situation?
>
> Thanks.
>
>
> On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote:
>
> I found no cron is running on root and user.
>
> According to the logrotate config, the log rotate weekly.
> But the error situation is likely happened nightly.
>
> So I think it will not be my concern.
>
>
> On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote:
>
> And when is 'cron' configured to run logrotate? Does it correspond to when 
> you are seeing the issue?
>
> On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote:
>
> Yes, the memory usage is under control now and I increased 
> maximum-requests by 4 times.
>
> For the error, 
> Timeout when reading response headers from daemon process 'site-1': /home/
> ubuntu/site-1/apache/wsgi.py
> came out as sudden, that's why I capture this message only. I also found 
> main Apache error log nothing special.
>
> Now, I turned the log level to 'info' in order to obtain more messages.
>
> About the log file rotation, I only found the ubuntu default rotation one 
> which is using logrotate.
> /var/log/apache2/*.log {
>         weekly
>         missingok
>         rotate 52
>         compress
>         delaycompress
>         notifempty
>         create 640 root adm
>         sharedscripts
>         postrotate
>                 if /etc/init.d/apache2 status > /dev/null ; then \
>                     /etc/init.d/apache2 reload > /dev/null; \
>                 fi;
>         endscript
>         prerotate
>                 if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
>                         run-parts /etc/logrotate.d/httpd-prerotate; \
>                 fi; \
>         endscript
> }
> It will do Apache reload on postrotate. I research a little bit, the 
> reload is needed for Apache handling file.
> Some suggested "copytruncate" on the config to avoid reload. Will you also 
> suggest that?
>
>
> Thanks.
>
>
> On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote:
>
>
> On 02/11/2014, at 2:18 AM, Kelvin Wong <[email protected]> wrote:
>
>
> I have some updates on the issue as it happened again. This time I 
> captured some data from New Relic.
>
> From the Requests.png, I found that there is an spike aroung 22:23 and 
> check the Apache log. I found there are many as following.
> Timeout when reading response headers from daemon process 'site-1': /home/
> ubuntu/site-1/apache/wsgi.py
>
> Spike also found on the Workers.png during that time period. The graph is 
> showing most of the worker is doing write?
> After a few minute, the following log appear again, and start returning 
> 503.
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi 
> (pid=27816): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.17227.2.3.sock'.
>
> Can you tell what happened from those graphs?
>
>
> What I need to see is more of the log file around those events.
>
> Ensure you have LogLevel set to 'info' in the Apache configuration as 
> mod_wsgi will then log more messages about why processes restarted.
>
> Also ensure you are checking for message around the time in the global 
> Apache error log and not just in any virtual host specific error log, as 
> some messages, such as evidence of a restart or process crashes will only 
> appear in the main Apache error log.
>
> With memory growth under control:
>
>
> turning to look at daemon process restarts:
>
>
> they are obviously more frequent. Zooming in, each process is restarting 
> about every 5 minutes, although varies over time.
>
> I would suggest you can increase maximum-requests, possibly as high as 4 
> times what you are using.
>
> When enabling 'info' logging that will reduce the amount of noise in the 
> error logs which may hide the specific anomalous event you are trying to 
> track down.
>
> BTW. Did you look into whether any log file rotation is set up or not and 
> what did you find?
>
> Depending on the system you might be able to find whether something is 
> running by looking at one cron jobs are running for root.
>
>     sudo crontab -l
>
> Graham
>
>
> Thanks.
>
>
> On Friday, October 31, 2014 5:08:27 PM UTC+8, Graham Dumpleton wrote:
>
>
> On 31/10/2014, at 7:26 PM, Kelvin Wong <[email protected]> wrote:
>
> Oh. I see. Thanks.
>
> One more question.
> As I set the graceful-timeout on mod_wsgi process, then when Apache do 
> graceful restart, will mod_wsgi obey that timeout setting?
>
>
> No. It is impossible. There is simply no way for the mod_wsgi daemon 
> processes to act in coordination with the Apache child worker processes on 
> an Apache graceful restart.
>
> This is because the only way that mod_wsgi can create its distinct child 
> processes and have them managed by Apache at all is using a mechanism that 
> in the Apache runtime library is referred to as a 'other processes'.
>
> It doesn't matter whether Apache as a whole is signalled for a normal 
> shutdown (SIGHUP) or a graceful restart (SIGUSR1), Apache will in turn 
> always send a SIGINT on to 'other processes' such as those created by 
> mod_wsgi. If those 'other processes' don't shutdown immediately it will 
> signal them again at one second intervals until 3 seconds have elapsed, at 
> which point it will send them a SIGKILL. So Apache only supports a brutal 
> shutdown of that sort for 'other processes'.
>
> For its own child worker processes, to even support a graceful restart is 
> a non trivial thing. It isn't just a matter of sending the child worker 
> processes a SIGUSR1. Instead there is an elaborate scheme involving what is 
> called a pipe of death between the Apache parent and the child worker 
> processes, as well as shared state via shared memory between the Apache 
> parent process so the child worker processes now what sort of shutdown has 
> been triggered. That sort of mechanism to support a graceful shutdown for 
> 'other processes' simply does exist.
>
> So it is how it is and there is no way to change things. In practice it 
> hasn't caused any issues and web requests being interrupted is always a 
> risk that a web application has to deal with anyway.
>
> Graham
>
> On Friday, October 31, 2014 4:16:55 PM UTC+8, Graham Dumpleton wrote:
>
>
>
> On 31/10/2014, at 6:53 PM, Kelvin Wong <[email protected]> wrote:
>
> You means that the OS triggered Apache graceful restart? If so, why the OS 
> do that?
>
>
> Yes and I just told you why, as part of a log rotation mechanism.
>
> Try googling 'log file rotation' and see what mechanism is used on your 
> specific Linux system.
>
> One link about one such solution which is often used is:
>
> http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/
>
> Apache does itself have a way of handling log rotation itself, but likely 
> ignored by Linux distros which use their own preferred mechanism.
>
> Graham
>
>
> On Friday, October 31, 2014 3:45:35 PM UTC+8, Graham Dumpleton wrote:
>
> Your operating system could be signalling Apache during the middle of the 
> night so that the log files can be rotated.
>
> Graham
>
> On 31/10/2014, at 6:36 PM, Kelvin Wong <[email protected]> wrote:
>
> I checked the memory of overall Apache process, it goes up and down a 
> little. It Seemed the "maximum-requests" is doing its job.
>
> I found that when I do "sudo service apache reload", Apache will do 
> graceful restart as it shown in main Apache error log.
>
>
> [Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 
> 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx
> :xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 
> 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.
> When before above situation happened, Apache also doing graceful restart.
> But this should not be triggered by anyone as I mentioned it happened 
> during sleep time.
> Is there any reasons that Apache do graceful restart by itself?
>
> Thanks.
>
>
> On Friday, October 31, 2014 2:55:32 PM UTC+8, Graham Dumpleton wrote:
>
> BTW, in your NR agent configuration file find the 
> error_collector.ignore_errors setting and change it to:
>
> error_collector.ignore_errors = django.http.request:UnreadablePostError
>
> These indicate the HTTP client connection dropping before all content had 
> been sent.
>
> Generally you want to ignore these and eliminate them from your error 
> count as they do not represent an error in your application but are the 
> result of a client problem.
>
> Adding that configuration setting will cause them to be ignored and not 
> recorded as an error.
>
> Graham
>
> On 31/10/2014, at 5:41 PM, Graham Dumpleton <[email protected]> wrote:
>
>
> On 31/10/2014, at 5:04 PM, Kelvin Wong <[email protected]> wrote:
>
> Thanks for the details!
>
> I now have a better picture on how it works.
> I employed all the recommendations you suggested. I will keep monitoring 
> the processes.
>
> But how about the graceful restart signaled by Apache? Why it happened?
>
>
> As far as the message:
>
> [Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 
> 140182690981632] (2)No such 
> file or directory: [client 24.171.250.159:60769] mod_wsgi (<
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Attachment: main_error.log
Description: Binary data

Reply via email to