I attached the stack trace when the request-timeout triggered. Can you give me some hints on what it is happening? It seemed related to the issue you mentioned.
Thanks. On Wednesday, November 5, 2014 11:10:54 AM UTC+8, Graham Dumpleton wrote: > > If the request-timeout mechanism of WSGIDaemonProcess triggers, it will > attempt to log stack frames for each request thread into the Apache error > log. That will be the first thing to look at. > > The alternative is: > > > http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Extracting_Python_Stack_Traces > > but you need to be present to trigger it to dump the stack traces, which > is why going to see whether the new feature of mod_wsgi to dump them > automatically will yield results first. > > Graham > > On 05/11/2014, at 2:02 PM, Kelvin Wong <[email protected] <javascript:>> > wrote: > > Thanks for your suggestions. > I will try request-timeout and queue-timeout settings for the process > recovery. > > Apache configuration currently the Timeout directive is set to default > 300, I will set it to 60. > > But can you suggest some guides and ways to spot out the deadlocked > requests? > Can it be spotted out on code or adding some logging in the code? > > Thanks again. > > On Tuesday, November 4, 2014 8:26:38 PM UTC+8, Graham Dumpleton wrote: > > Here is what I have. > > You appear to be running daemon process groups with only a single process. > For one of those daemon process groups, the daemon process hung. > > The life time of the process is shown by the following logs. > > Daemon logging: > > [Mon Nov 03 10:39:25.824186 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Attach interpreter ''. > [Mon Nov 03 10:39:25.824339 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Adding '/home/ubuntu/site-1/django' > to path. > [Mon Nov 03 10:39:25.824488 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Adding > '/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages' to path. > [Mon Nov 03 10:39:25.826052 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Imported 'mod_wsgi'. > [Mon Nov 03 10:39:25.828107 2014] [wsgi:info] [pid 2517:tid > 140701830735616] [remote xx.xx.xx.xx:64697] mod_wsgi (pid=2517, > process='site-1', application=''): Loading WSGI script > '/home/ubuntu/site-1/apache/wsgi.py'. > [Mon Nov 03 10:39:27.205378 2014] [wsgi:error] [pid 2517:tid > 140701830735616] 2014-11-03 10:39:27,204 (2517/MainThread) > newrelic.core.agent INFO - New Relic Python Agent (2.32.0.28) > [Mon Nov 03 10:39:27.205566 2014] [wsgi:error] [pid 2517:tid > 140701637703424] 2014-11-03 10:39:27,205 (2517/Dummy-1) newrelic.core.agent > INFO - New Relic Python Agent (2.32.0.28) > [Mon Nov 03 10:39:30.871519 2014] [wsgi:error] [pid 2517:tid > 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) > newrelic.core.data_collector INFO - Successfully registered New Relic > Python agent where app_name='Snaptee', pid=2517, redirect_host=u'collector- > 129.newrelic.com' and agent_run_id=1824811776, in 0.50 seconds. > [Mon Nov 03 10:39:30.871653 2014] [wsgi:error] [pid 2517:tid > 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) > newrelic.core.data_collector INFO - Reporting to: > https://rpm.newrelic.com/accounts/375291/applications/2915135 > … > [Mon Nov 03 11:21:04.860039 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Shutdown requested 'site-1'. > … > > The above logging is missing the full set of messages one would expect to > see on process shutdown. This means it was hung and the Apache parent > process would have brutally killed it when it didn't shutdown itself. > > Apache logging: > > [Mon Nov 03 10:39:25.812190 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Starting process 'site-1-api-ssl' > with uid=33, gid=33 and threads=25. > [Mon Nov 03 10:39:25.812744 2014] [wsgi:info] [pid 2517:tid > 140701954910080] mod_wsgi (pid=2517): Initializing Python. > [… > [Mon Nov 03 11:21:04.857993 2014] [mpm_worker:notice] [pid 1237:tid > 140701954910080] AH00297: SIGUSR1 received. Doing graceful restart > [Mon Nov 03 11:21:04.858106 2014] [wsgi:info] [pid 1237] mod_wsgi > (pid=2517): Process 'site-1-api-ssl' has been deregistered and will no > longer be monitored. > > From our own tracking of what the New Relic agent is up to, we had a > lifetime for that process of: > > 5288315 / 1824811776 ip-10-0-21-120 Snaptee 2517 2014-11-03 02:39:31 > 2014-11-03 > 03:01:34 2.32.0.28 Apache/mod_wsgi (daemon) > > Our time zone is different, but you will note that it had the process life > ending at 1 minute after the hour where logs show it was killed at 21 > minutes past the hour. > > The again confirms to me the process was hung. More significantly, > whatever occurred also hung our agent background thread which does > reporting of data. If it was still running then the life of the process > wouldn't have been recorded as ending prematurely. In reality data > reporting would have stopped even before that time as we don't mark it dead > straight away when we stop getting data and it didn't send a proper > disconnect message. > > Unfortunately can't get anything more useful from NR metrics as the Lite > plan will start summarising data after a day and loose ability to > meaningfully zoom into historical data. > > So the issue is why the process hung. > > The main culprit for that in a multithreaded system is thread deadlock. > > This definitely could be an issue in your code but would expect if that > was the case that would only affect the request handler threads and not the > NR agent background thread as it doesn't call into your code. > > The one scenario I do know of that can explain it and we have experienced > ourselves and even reported it as a bug with tests to replicate, is an > issue in PostgreSQL client libraries which affect Python web applications > which also do external HTTP requests using SSL connections. For whatever > reason the use of the NR agent changes the dynamics of a web application > enough that it seems to trigger the bug in PostgreSQL more easily. > > The details of that PostgreSQL bug can be found at: > > http://www.postgresql.org/message-id/[email protected] > > We are still waiting to see a fix from the PostgreSQL folks about that. > > Now, given that the process is hanging, there are a couple of things that > can be done. > > The first is to set some options on the mod_wsgi daemon process group > which will help with detecting a stuck process and automatically restarting > it so you don't have to get out of bed in the middle of the night and > manually restart it. > > The second is to get some Python tracebacks for where it is stuck. Luckily > (provided it works), the stuck process detection feature will try and log > Python stack trace data for all threads on shutdown when it detects such a > problem. > > The option you want to use here with WSGIDaemonProcess is request-timeout. > The trick here is going to be to work out a good value that is appropriate > for your web application and the fact you are using 25 threads. > > When one has 1 thread per process, the request-timeout works as a hard > limit. > > That is, when handling a web request, it takes more than the > request-timeout then the process will be restarted to eject the potentially > stuck request. > > When you have more than one 1 thread per process handling request, a hard > limit isn't necessarily a good idea due to the potential to kill off other > active requests which are running fine. The graceful-timeout doesn't really > help in this case because the stuck thread isn't going to complete anyway, > so you end up just going into the more brutal shutdown sequence anyway. > > Also, if only one request gets stuck, for example on a backend service > call, in your case you would still have 24 other threads, so seems a bit > excessive to restart the whole process when it could chug along quite > happily anyway and the request may still complete after a while. > > How request-timeout is interpreted therefore in a multi threaded situation > is that if you have set it to 45 seconds for example, when the average > accumulated response time across all request thread slots for active > requests has reached 45 seconds, only then will it restart the process. > > Imagine a worst case scenario where 25 requests came in at the same time > and they all managed to deadlock on this PostgreSQL issue. As they started > at the same time, then 45 seconds later, they will all have been running > for 45 seconds and so the process will be restarted. > > Now imagine though that only one thread had got stuck, after 45 seconds > the process wouldn't be restarted as the average across all request thread > slots would be 45/25 if there were no other requests running. > > So the algorithm for now at least, tries to deal with the fact that there > may still be threads which can be used and allow the process to run as long > as possible before forcing a restart. > > If further threads start getting stuck and as their running time grows, > then the average will keep growing and eventually you will hit the 45 > second average. This will obviously occur quite quickly if all threads > start getting stuck at the same time as seemed to happen in your case. > > My suggestion therefore is to set request-timeout on WSGIDaemonProcess. > Select a timeout value (in seconds), which is above your typical worst case > scenario for response time. The 99th percentile for your requests seems to > be up around 25 seconds, so perhaps start with 60 seconds for the timeout > initially. > > So like with maximum-requests, we are putting in place something that will > allow your system to keep going and recover, or avoid problems, while you > work out the cause of why the process is hanging. > > For the request-timeout one. Although the process will now recover, you > will have a bunch of backlogged requests which will immediately get sent > into the daemon process and temporarily overload it. Those requests could > have hung around for quite a while and likely users no longer waiting. For > these I would suggest setting queue-timeout option to WSGIDaemonProcess. > This will cause the backlog to be discarded with a service unavailable > allowing the daemon process to better recover. > > Initially, set the queue-timeout to be 45 seconds and can see how it goes. > > If the problem occurs again, you may only know by messages in Apache error > logs about request and queue timeouts. If it occurs, keep an eye out for > stack trace information as well in the logs. > > There are also various other time outs one can set to override some > defaults. Some defaults are based off the value of the Timeout directive in > Apache. > > Can you tell me what your Apache configuration currently has the Timeout > directive set to, or whether it is using the default. The builtin default > for Timeout should be 60. Some same configuration files override that to be > 300 which is a ridiculously high value. Based on some of the log messages I > suspect in your system it is set to 300. > > Depending on what you say, may set a few other timeout values as well so > as to best setup recovery for failure cases. > > Graham > > On 04/11/2014, at 8:58 PM, Kelvin Wong <[email protected]> wrote: > > I attached the main Apache error log for the same period of time. > > > For the system setup, > 1. I have no backup process for that instance. it's a web app server only. > > 2. Except the web application, no process connecting to database. > > 3. I used AWS Elastic load balancer and New Relic pinger to monitor the > liveness of the website, not specified for Apache. All restart will be > triggered manually for now. > > > Thanks. > > > On Tuesday, November 4, 2014 5:43:28 PM UTC+8, Graham Dumpleton wrote: > > Can you supply the main Apache error log for the same period of time? Not > all information is in the virtual host specific error log, including key > information which captures when an Apache restart was triggered. > > Based on the incomplete information from just that log, it opens the > possibility that the restart was actually the outcome of problems which > arose and not the trigger. I really need that main Apache error log to be > sure. > > As well as supplying the main Apache error log, can you indicate whether > your system is set up with any of these things. > > 1. Any sort of backup process, be it file system backups or database > backups. If there are such backup services, what time of day do they run. > > 2. Any sort of intensive batch update processing which loads data into any > database at a specific time of day. > > 3. Do you have any monitoring system which checks for the liveness of > Apache and which would force a restart if it was unable to poll Apache and > thus though it was hung. > > Lack of a permalink from NR UI (bottom right of pages) makes it hard to > find the specific time of day for the event you are talking about because > of uncertainties of timezones, but I see evidence of various other strange > things going on at times. Lets see those main Apache error logs first > though and then can properly come up with a plausible explanation of what > may have triggered things. > > Graham > > On 04/11/2014, at 2:53 AM, Kelvin Wong <[email protected]> wrote: > > I attached the error log. See it is enough. > > > On Monday, November 3, 2014 8:10:00 PM UTC+8, Graham Dumpleton wrote: > > Can you provide further back in time and further forward in time. > > Basically, from when everything looked normal until everything was back to > normal. > > I get the feel that this has chopped off bits and not everything I might > want to see is there. > > Sleep time here. Will then look tomorrow. > > Graham > > On 03/11/2014, at 10:25 PM, Kelvin Wong <[email protected]> wrote: > > I obtained log messages on production when the error happened. The log > level is set to 'info'. > > [Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'. > [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'. > [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Destroying interpreters. > [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'. > [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'. > [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Destroying interpreters. > [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'. > [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'. > [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Destroying interpreters. > [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''. > [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Terminating Python. > [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Python has shutdown. > [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid > 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'. > [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Terminating Python. > [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Python has shutdown. > [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid > 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'. > [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Terminating Python. > [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Python has shutdown. > [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid > 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'. > [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid > 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid > 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid > 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid > 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid > 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid > 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid > 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid > 140701807068928] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid > 140701660579584] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid > 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid > 140701823854336] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid > 140701593437952] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid > 140701652186880] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid > 140701840639744] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid > 140701719328512] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid > 140701685757696] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid > 140701823854336] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid > 140701585045248] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid > 140701652186880] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid > 140701719328512] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid > 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response > headers from daemon process 'site-1': > /home/ubuntu/snaptee-production/apache/wsgi.py > [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid > 140701610223360] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid > 140701601830656] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid > 140701660579584] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid > 140701807068928] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid > 140701601830656] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid > 140701568259840] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid > 140701677364992] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid > 140701668972288] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid > 140701610223360] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid > 140701815461632] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid > 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering > immediate shutdown 'site-1'. > [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'. > [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'. > [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Destroying interpreters. > [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid > 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''. > [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid > 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent > INFO - New Relic Python Agent Shutdown > [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid > 140701710935808] (11)Resource temporarily unavailable: [client > xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon > process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'. > > Please advise. Thanks. > > > On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote: > > On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected]> wrote: > > I don't think the original error situation is caused by server overloading > as the traffic of the server on that time it's fine. > > > I didn't say it was. > > I just want to rule out some conditions as returning 503 can be caused by > overloading the service. > > Overloading the servers can produce 503 but the error log messages do not > match with the original error log message. > > > And I am only addressing the message: > > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > That area of the code was rewritten in 4.X versions at some point and > introduced non blocking connects. Looking more at that specific message and > knowing the possibility that you were load testing I came across that > difference in Linux behaviour compared to what POSIX specifies. > > So all I can do is address that specific point and that it is all I can do > until I can get those info level messages from the error logs so I can > understand whether the mod_wsgi daemon processes are actually restarting at > the same time. > > Graham > > > On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote: > > I still want any log messages either side of that point as it may confirm > what is going on, but what you might be encountering could be due to Linux > behaviour when performing a connect() on a UNIX socket in non blocking mode > not being POSIX compliant. Details: > > http://lists.mysql.com/internals/35082 > > The mod_wsgi code is possibly not dealing with Linux's non POSIX > compliance and instead of interpreting it the same as connection refused > (when the listener socket backlog is full), it is regarding it as an > outright failure. > > So the cause of this if you are trying to do load testing is that you are > overwhelming the capacity of the mod_wsgi daemon processes causing the > socket listener queue to fill up. When it fills up then it is triggering > the non POSIX behaviour of Linux. > > Can you confirm that you only saw this when you overwhelmed the server > with more traffic that you would realistically expect it to receive and way > beyond the capacity it could handle based on how it is configured? > > Graham > > On 03/11/2014, at 8:33 PM, Graham Dumpleton <[email protected]> wrote: > > As I said before, I need to see messages either side of the event and with > Apache log level set to info. > > Graham > > On 03/11/2014, at 8:28 PM, Kelvin Wong <[email protected]> wrote: > > I tried some load testing on the server to see if it is caused by > overloading. > > I found there will be some other log message when the server is > overloading. > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] > [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or > oversized response headers received from daemon process 'site-1': > /home/ubuntu/site-1/apache/wsgi.py > 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] > [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: > [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI > daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'. > > Then I concluded that the situation is not caused by overloading the > server. > > Do you have any directions for me to reproduce the situation? > > Thanks. > > > On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote: > > I found no cron is running on root and user. > > According to the logrotate config, the log rotate weekly. > But the error situation is likely happened nightly. > > So I think it will not be my concern. > > > On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote: > > And when is 'cron' configured to run logrotate? Does it correspond to when > you are seeing the issue? > > On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote: > > Yes, the memory usage is under control now and I increased > maximum-requests by 4 times. > > For the error, > Timeout when reading response headers from daemon process 'site-1': /home/ > ubuntu/site-1/apache/wsgi.py > came out as sudden, that's why I capture this message only. I also found > main Apache error log nothing special. > > Now, I turned the log level to 'info' in order to obtain more messages. > > About the log file rotation, I only found the ubuntu default rotation one > which is using logrotate. > /var/log/apache2/*.log { > weekly > missingok > rotate 52 > compress > delaycompress > notifempty > create 640 root adm > sharedscripts > postrotate > if /etc/init.d/apache2 status > /dev/null ; then \ > /etc/init.d/apache2 reload > /dev/null; \ > fi; > endscript > prerotate > if [ -d /etc/logrotate.d/httpd-prerotate ]; then \ > run-parts /etc/logrotate.d/httpd-prerotate; \ > fi; \ > > > ... -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
stacktrace.log
Description: Binary data
