Thanks for your suggestions. I will try request-timeout and queue-timeout settings for the process recovery.
Apache configuration currently the Timeout directive is set to default 300, I will set it to 60. But can you suggest some guides and ways to spot out the deadlocked requests? Can it be spotted out on code or adding some logging in the code? Thanks again. On Tuesday, November 4, 2014 8:26:38 PM UTC+8, Graham Dumpleton wrote: > > Here is what I have. > > You appear to be running daemon process groups with only a single process. > For one of those daemon process groups, the daemon process hung. > > The life time of the process is shown by the following logs. > > Daemon logging: > > [Mon Nov 03 10:39:25.824186 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Attach interpreter ''. > [Mon Nov 03 10:39:25.824339 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Adding '/home/ubuntu/site-1/django' > to path. > [Mon Nov 03 10:39:25.824488 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Adding > '/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages' to path. > [Mon Nov 03 10:39:25.826052 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Imported 'mod_wsgi'. > [Mon Nov 03 10:39:25.828107 2014] [wsgi:info] [pid 2517:tid > 140701830735616] [remote xx.xx.xx.xx:64697] mod_wsgi (pid=2517, > process='site-1', application=''): Loading WSGI script > '/home/ubuntu/site-1/apache/wsgi.py'. > [Mon Nov 03 10:39:27.205378 2014] [wsgi:error] [pid 2517:tid > 140701830735616] 2014-11-03 10:39:27,204 (2517/MainThread) > newrelic.core.agent INFO - New Relic Python Agent (2.32.0.28) > [Mon Nov 03 10:39:27.205566 2014] [wsgi:error] [pid 2517:tid > 140701637703424] 2014-11-03 10:39:27,205 (2517/Dummy-1) newrelic.core.agent > INFO - New Relic Python Agent (2.32.0.28) > [Mon Nov 03 10:39:30.871519 2014] [wsgi:error] [pid 2517:tid > 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) > newrelic.core.data_collector INFO - Successfully registered New Relic > Python agent where app_name='Snaptee', pid=2517, redirect_host=u'collector- > 129.newrelic.com' and agent_run_id=1824811776, in 0.50 seconds. > [Mon Nov 03 10:39:30.871653 2014] [wsgi:error] [pid 2517:tid > 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) > newrelic.core.data_collector INFO - Reporting to: > https://rpm.newrelic.com/accounts/375291/applications/2915135 > … > [Mon Nov 03 11:21:04.860039 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Shutdown requested 'site-1'. > … > > The above logging is missing the full set of messages one would expect to > see on process shutdown. This means it was hung and the Apache parent > process would have brutally killed it when it didn't shutdown itself. > > Apache logging: > > [Mon Nov 03 10:39:25.812190 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Starting process 'site-1-api-ssl' > with uid=33, gid=33 and threads=25. > [Mon Nov 03 10:39:25.812744 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Initializing Python. > [… > [Mon Nov 03 11:21:04.857993 2014] [mpm_worker:notice] [pid 1237:tid > 140701954910080] AH00297: SIGUSR1 received. Doing graceful restart > [Mon Nov 03 11:21:04.858106 2014] [wsgi:info] [pid 1237] mod_wsgi > (pid=2517): Process 'site-1-api-ssl' has been deregistered and will no > longer be monitored. > > From our own tracking of what the New Relic agent is up to, we had a > lifetime for that process of: > > 5288315 / 1824811776 ip-10-0-21-120 Snaptee 2517 2014-11-03 02:39:31 > 2014-11-03 > 03:01:34 2.32.0.28 Apache/mod_wsgi (daemon) > > Our time zone is different, but you will note that it had the process life > ending at 1 minute after the hour where logs show it was killed at 21 > minutes past the hour. > > The again confirms to me the process was hung. More significantly, > whatever occurred also hung our agent background thread which does > reporting of data. If it was still running then the life of the process > wouldn't have been recorded as ending prematurely. In reality data > reporting would have stopped even before that time as we don't mark it dead > straight away when we stop getting data and it didn't send a proper > disconnect message. > > Unfortunately can't get anything more useful from NR metrics as the Lite > plan will start summarising data after a day and loose ability to > meaningfully zoom into historical data. > > So the issue is why the process hung. > > The main culprit for that in a multithreaded system is thread deadlock. > > This definitely could be an issue in your code but would expect if that > was the case that would only affect the request handler threads and not the > NR agent background thread as it doesn't call into your code. > > The one scenario I do know of that can explain it and we have experienced > ourselves and even reported it as a bug with tests to replicate, is an > issue in PostgreSQL client libraries which affect Python web applications > which also do external HTTP requests using SSL connections. For whatever > reason the use of the NR agent changes the dynamics of a web application > enough that it seems to trigger the bug in PostgreSQL more easily. > > The details of that PostgreSQL bug can be found at: > > http://www.postgresql.org/message-id/[email protected] > > We are still waiting to see a fix from the PostgreSQL folks about that. > > Now, given that the process is hanging, there are a couple of things that > can be done. > > The first is to set some options on the mod_wsgi daemon process group > which will help with detecting a stuck process and automatically restarting > it so you don't have to get out of bed in the middle of the night and > manually restart it. > > The second is to get some Python tracebacks for where it is stuck. Luckily > (provided it works), the stuck process detection feature will try and log > Python stack trace data for all threads on shutdown when it detects such a > problem. > > The option you want to use here with WSGIDaemonProcess is request-timeout. > The trick here is going to be to work out a good value that is appropriate > for your web application and the fact you are using 25 threads. > > When one has 1 thread per process, the request-timeout works as a hard > limit. > > That is, when handling a web request, it takes more than the > request-timeout then the process will be restarted to eject the potentially > stuck request. > > When you have more than one 1 thread per process handling request, a hard > limit isn't necessarily a good idea due to the potential to kill off other > active requests which are running fine. The graceful-timeout doesn't really > help in this case because the stuck thread isn't going to complete anyway, > so you end up just going into the more brutal shutdown sequence anyway. > > Also, if only one request gets stuck, for example on a backend service > call, in your case you would still have 24 other threads, so seems a bit > excessive to restart the whole process when it could chug along quite > happily anyway and the request may still complete after a while. > > How request-timeout is interpreted therefore in a multi threaded situation > is that if you have set it to 45 seconds for example, when the average > accumulated response time across all request thread slots for active > requests has reached 45 seconds, only then will it restart the process. > > Imagine a worst case scenario where 25 requests came in at the same time > and they all managed to deadlock on this PostgreSQL issue. As they started > at the same time, then 45 seconds later, they will all have been running > for 45 seconds and so the process will be restarted. > > Now imagine though that only one thread had got stuck, after 45 seconds > the process wouldn't be restarted as the average across all request thread > slots would be 45/25 if there were no other requests running. > > So the algorithm for now at least, tries to deal with the fact that there > may still be threads which can be used and allow the process to run as long > as possible before forcing a restart. > > If further threads start getting stuck and as their running time grows, > then the average will keep growing and eventually you will hit the 45 > second average. This will obviously occur quite quickly if all threads > start getting stuck at the same time as seemed to happen in your case. > > My suggestion therefore is to set request-timeout on WSGIDaemonProcess. > Select a timeout value (in seconds), which is above your typical worst case > scenario for response time. The 99th percentile for your requests seems to > be up around 25 seconds, so perhaps start with 60 seconds for the timeout > initially. > > So like with maximum-requests, we are putting in place something that will > allow your system to keep going and recover, or avoid problems, while you > work out the cause of why the process is hanging. > > For the request-timeout one. Although the process will now recover, you > will have a bunch of backlogged requests which will immediately get sent > into the daemon process and temporarily overload it. Those requests could > have hung around for quite a while and likely users no longer waiting. For > these I would suggest setting queue-timeout option to WSGIDaemonProcess. > This will cause the backlog to be discarded with a service unavailable > allowing the daemon process to better recover. > > Initially, set the queue-timeout to be 45 seconds and can see how it goes. > > If the problem occurs again, you may only know by messages in Apache error > logs about request and queue timeouts. If it occurs, keep an eye out for > stack trace information as well in the logs. > > There are also various other time outs one can set to override some > defaults. Some defaults are based off the value of the Timeout directive in > Apache. > > Can you tell me what your Apache configuration currently has the Timeout > directive set to, or whether it is using the default. The builtin default > for Timeout should be 60. Some same configuration files override that to be > 300 which is a ridiculously high value. Based on some of the log messages I > suspect in your system it is set to 300. > > Depending on what you say, may set a few other timeout values as well so > as to best setup recovery for failure cases. > > Graham > > On 04/11/2014, at 8:58 PM, Kelvin Wong <[email protected] <javascript:>> > wrote: > > I attached the main Apache error log for the same period of time. > > > For the system setup, > 1. I have no backup process for that instance. it's a web app server only. > > 2. Except the web application, no process connecting to database. > > 3. I used AWS Elastic load balancer and New Relic pinger to monitor the > liveness of the website, not specified for Apache. All restart will be > triggered manually for now. > > > Thanks. > > > On Tuesday, November 4, 2014 5:43:28 PM UTC+8, Graham Dumpleton wrote: > > Can you supply the main Apache error log for the same period of time? Not > all information is in the virtual host specific error log, including key > information which captures when an Apache restart was triggered. > > Based on the incomplete information from just that log, it opens the > possibility that the restart was actually the outcome of problems which > arose and not the trigger. I really need that main Apache error log to be > sure. > > As well as supplying the main Apache error log, can you indicate whether > your system is set up with any of these things. > > 1. Any sort of backup process, be it file system backups or database > backups. If there are such backup services, what time of day do they run. > > 2. Any sort of intensive batch update processing which loads data into any > database at a specific time of day. > > 3. Do you have any monitoring system which checks for the liveness of > Apache and which would force a restart if it was unable to poll Apache and > thus though it was hung. > > Lack of a permalink from NR UI (bottom right of pages) makes it hard to > find the specific time of day for the event you are talking about because > of uncertainties of timezones, but I see evidence of various other strange > things going on at times. Lets see those main Apache error logs first > though and then can properly come up with a plausible explanation of what > may have triggered things. > > Graham > > On 04/11/2014, at 2:53 AM, Kelvin Wong <[email protected]> wrote: > > I attached the error log. See it is enough. > > > On Monday, November 3, 2014 8:10:00 PM UTC+8, Graham Dumpleton wrote: > > Can you provide further back in time and further forward in time. > > Basically, from when everything looked normal until everything was back to > normal. > > I get the feel that this has chopped off bits and not everything I might > want to see is there. > > Sleep time here. Will then look tomorrow. > > Graham > > On 03/11/2014, at 10:25 PM, Kelvin Wong <[email protected]> wrote: > > I obtained log messages on production when the error happened. The log > level is set to 'info'. > > [Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'. > [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'. > [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Destroying interpreters. > [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'. > [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'. > [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Destroying interpreters. > [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'. > [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'. > [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Destroying interpreters. > [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Terminating Python. > [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Python has shutdown. > [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'. > [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Terminating Python. > [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Python has shutdown. > [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'. > [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Terminating Python. > [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Python has shutdown. > [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'. > [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid > 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid > 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid > 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid > 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid > 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid > 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid > 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid > 140701807068928] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid > 140701660579584] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid > 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid > 140701823854336] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid > 140701593437952] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid > 140701652186880] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid > 140701840639744] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid > 140701719328512] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid > 140701685757696] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid > 140701823854336] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid > 140701585045248] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid > 140701652186880] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid > 140701719328512] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid > 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid > 140701610223360] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid > 140701601830656] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid > 140701660579584] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid > 140701807068928] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid > 140701601830656] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid > 140701668972288] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid > 140701610223360] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid > 140701815461632] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid > 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering > immediate shutdown 'site-1'. > [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'. > [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'. > [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Destroying interpreters. > [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''. > [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid > 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent > INFO - New Relic Python Agent Shutdown > [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid > 140701710935808] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > > Please advise. Thanks. > > > On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote: > > On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected]> wrote: > > I don't think the original error situation is caused by server overloading > as the traffic of the server on that time it's fine. > > > I didn't say it was. > > I just want to rule out some conditions as returning 503 can be caused by > overloading the service. > > Overloading the servers can produce 503 but the error log messages do not > match with the original error log message. > > > And I am only addressing the message: > > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > That area of the code was rewritten in 4.X versions at some point and > introduced non blocking connects. Looking more at that specific message and > knowing the possibility that you were load testing I came across that > difference in Linux behaviour compared to what POSIX specifies. > > So all I can do is address that specific point and that it is all I can do > until I can get those info level messages from the error logs so I can > understand whether the mod_wsgi daemon processes are actually restarting at > the same time. > > Graham > > > On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote: > > I still want any log messages either side of that point as it may confirm > what is going on, but what you might be encountering could be due to Linux > behaviour when performing a connect() on a UNIX socket in non blocking mode > not being POSIX compliant. Details: > > http://lists.mysql.com/internals/35082 > > The mod_wsgi code is possibly not dealing with Linux's non POSIX > compliance and instead of interpreting it the same as connection refused > (when the listener socket backlog is full), it is regarding it as an > outright failure. > > So the cause of this if you are trying to do load testing is that you are > overwhelming the capacity of the mod_wsgi daemon processes causing the > socket listener queue to fill up. When it fills up then it is triggering > the non POSIX behaviour of Linux. > > Can you confirm that you only saw this when you overwhelmed the server > with more traffic that you would realistically expect it to receive and way > beyond the capacity it could handle based on how it is configured? > > Graham > > On 03/11/2014, at 8:33 PM, Graham Dumpleton <[email protected]> wrote: > > As I said before, I need to see messages either side of the event and with > Apache log level set to info. > > Graham > > On 03/11/2014, at 8:28 PM, Kelvin Wong <[email protected]> wrote: > > I tried some load testing on the server to see if it is caused by > overloading. > > I found there will be some other log message when the server is > overloading. > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] > [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or > oversized response headers received from daemon process 'site-1': > /home/ubuntu/site-1/apache/wsgi.py > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > Then I concluded that the situation is not caused by overloading the > server. > > Do you have any directions for me to reproduce the situation? > > Thanks. > > > On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote: > > I found no cron is running on root and user. > > According to the logrotate config, the log rotate weekly. > But the error situation is likely happened nightly. > > So I think it will not be my concern. > > > On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote: > > And when is 'cron' configured to run logrotate? Does it correspond to when > you are seeing the issue? > > On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote: > > Yes, the memory usage is under control now and I increased > maximum-requests by 4 times. > > For the error, > Timeout when reading response headers from daemon process 'site-1': /home/ > ubuntu/site-1/apache/wsgi.py > came out as sudden, that's why I capture this message only. I also found > main Apache error log nothing special. > > Now, I turned the log level to 'info' in order to obtain more messages. > > About the log file rotation, I only found the ubuntu default rotation one > which is using logrotate. > /var/log/apache2/*.log { > weekly > missingok > rotate 52 > compress > delaycompress > notifempty > create 640 root adm > sharedscripts > postrotate > if /etc/init.d/apache2 status > /dev/null ; then \ > /etc/init.d/apache2 reload > /dev/null; \ > fi; > endscript > prerotate > if [ -d /etc/logrotate.d/httpd-prerotate ]; then \ > run-parts /etc/logrotate.d/httpd-prerotate; \ > fi; \ > endscript > } > It will do Apache reload on postrotate. I research a little bit, the > reload is needed for Apache handling file. > Some suggested "copytruncate" on the config to avoid reload. Will you also > suggest that? > > > Thanks. > > > On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote: > > > On 02/11/2014, at 2:18 AM, Kelvin Wong <[email protected]> wrote: > > > I have some updates on the issue as it happened again. This time I > captured some data from New Relic. > > From the Requests.png, I found that there is an spike aroung 22:23 and > check the Apache log. I found there are many as following. > Timeout when reading response headers from daemon process 'site-1': /home/ > ubuntu/site-1/apache/wsgi.py > > Spike also found on the Workers.png during that time period. The graph is > showing most of the worker is doing write? > After a few minute, the following log appear again, and start returning > 503. > <div style="background-color:rgb(250,250,250);border-color:rgb(187,187, > > ... -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
