Thanks a lot.

I used Django connection pooling feature to avoid making database 
connection frequently.


On Wednesday, November 5, 2014 6:33:29 PM UTC+8, Graham Dumpleton wrote:
>
> If all the request threads reported as being stuck in the PostgreSQL 
> connect() call, then it would appear to be the same issue.
>
> The problem is caused by a bug in the libpq client library for PostgreSQL. 
> That there is a problem has been acknowledged by the PostgreSQL developers 
> in:
>
> http://www.postgresql.org/message-id/[email protected]
>
> Unfortunately the PostgreSQL developers have no public issue tracker that 
> I know of so I have no idea of the status of the issue and whether they 
> have a fix or whether one has been released yet. Given that was only 
> acknowledged recently, likely not. Even if they had come out with a fix, 
> that Linux distributions for past versions do not update packages except 
> for security fixes, if you are using a packaged PostgreSQL then you likely 
> will never see the fix while you are using the older Linux version.
>
> As to the scenario which causes the problem, it arises when you are 
> talking to your PostgreSQL database using a SSL connection and where at the 
> same time, other code within your application in parallel threads is making 
> separate SSL connections out to backend web services or remote web sites.
>
> So if for example any of your HTTP call outs to Instagram, Facebook or 
> Amazon S3 are over SSL connections, you can hit this problem. That the NR 
> agent also makes HTTPS call outs once a minute will add to the possibility 
> that you could hit the bug in PostgreSQL, but the issue can arise 
> independent of whether you are using the NR agent.
>
> Getting down to the actual bug in PostgreSQL, what it does when it is 
> making SSL connections is that when it hasn't already got an SSL connection 
> to the database open, it will the first time do some initialisation in the 
> SSL libraries to register some callbacks to allow it to manage some locks 
> on some resource. It will keep these callbacks in place so long as it has 
> at least one active SSL connection. When all the SSL connections to the 
> database are closed, it will deregister its callbacks. They would be 
> reregistered the next time it tries to create a database connection again.
>
> The problem is where a separate SSL connection is created for a HTTP 
> request or something else, the duration of which starts while there is an 
> active database SSL connection, but finishes after the last database SSL 
> connection had closed.
>
> In this situation, the HTTP request over SSL will still go into the 
> callbacks setup by PostgreSQL to do the locking it wants, but when the HTTP 
> request completes, PostgreSQL has removed those callbacks and so they 
> aren't called and the locks that were created at the start of the HTTP 
> request are not released. Now the next time that an SSL connection is 
> created by PostgreSQL, it will setup its callbacks again and try and 
> acquire the locks again, but since they were never released at the end of 
> the HTTP request, it deadlocks instead. Over time as each request handler 
> is run, this will happen to all threads and so the whole process gets stuck.
>
> Overall it is a pretty evil bug so am not sure why it hasn't been noticed 
> before. It does actually explain though cases which have popped up over the 
> years where mod_wsgi users web applications were deadlocking, but where 
> never got sufficient information out of people to track it down.
>
> What can you do.
>
> If you can't avoid having to use SSL connections for outbound HTTP 
> connections from your application it is going to be hard to avoid it 
> completely.
>
> There are perhaps a couple of things you could do to lesson the 
> possibility of it occurring though.
>
> The first is not use SSL connections for HTTP calls where you don't 
> absolutely require it. Obviously this usually isn't going to be possible 
> though.
>
> You might for example configure the NR agent not to use SSL connections by 
> setting:
>
> ssl = false
>
> in the agent configuration file.
>
> This does mean that data goes in the clear. For metric data this isn't 
> going to be a problem as it is generally innocuous. The only area of 
> concern may be error details, which can contain Python traceback 
> information.
>
> Making that change for the NR agent isn't going to help you with 
> Instagram. Facebook and Amazon S3 if they must be SSL.
>
> The other thing you can do is if using Django, is to use a new enough 
> Django version that you can use the database connection pooling feature of 
> Django. This will help as normally Django creates a new database connection 
> per request and so there is lots of opportunities where there will not 
> always be an active database connection and so a HTTP request over an SSL 
> connection can occur which overlaps when the database connection was closed.
>
> So by using the database connection pooling mechanism in Django, then the 
> database connections will be kept active for the life of the whole process, 
> thereby pretty well completely eliminating the probability it can occur.
>
> Graham
>
> On 05/11/2014, at 6:45 PM, Kelvin Wong <[email protected] <javascript:>> 
> wrote:
>
> I attached the stack trace when the request-timeout triggered.
> Can you give me some hints on what it is happening?
> It seemed related to the issue you mentioned.
>
> Thanks.
>
> On Wednesday, November 5, 2014 11:10:54 AM UTC+8, Graham Dumpleton wrote:
>
> If the request-timeout mechanism of WSGIDaemonProcess triggers, it will 
> attempt to log stack frames for each request thread into the Apache error 
> log. That will be the first thing to look at.
>
> The alternative is:
>
>
> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Extracting_Python_Stack_Traces
>
> but you need to be present to trigger it to dump the stack traces, which 
> is why going to see whether the new feature of mod_wsgi to dump them 
> automatically will yield results first.
>
> Graham
>
> On 05/11/2014, at 2:02 PM, Kelvin Wong <[email protected]> wrote:
>
> Thanks for your suggestions.
> I will try request-timeout and queue-timeout settings for the process 
> recovery.
>
> Apache configuration currently the Timeout directive is set to default 
> 300, I will set it to 60.
>
> But can you suggest some guides and ways to spot out the deadlocked 
> requests?
> Can it be spotted out on code or adding some logging in the code?
>
> Thanks again.
>
> On Tuesday, November 4, 2014 8:26:38 PM UTC+8, Graham Dumpleton wrote:
>
> Here is what I have.
>
> You appear to be running daemon process groups with only a single process. 
> For one of those daemon process groups, the daemon process hung.
>
> The life time of the process is shown by the following logs.
>
> Daemon logging:
>
> [Mon Nov 03 10:39:25.824186 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Attach interpreter ''.
> [Mon Nov 03 10:39:25.824339 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Adding '/home/ubuntu/site-1/django' 
> to path.
> [Mon Nov 03 10:39:25.824488 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Adding 
> '/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages' to path.
> [Mon Nov 03 10:39:25.826052 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Imported 'mod_wsgi'.
> [Mon Nov 03 10:39:25.828107 2014] [wsgi:info] [pid 2517:tid 
> 140701830735616] [remote xx.xx.xx.xx:64697] mod_wsgi (pid=2517, 
> process='site-1', application=''): Loading WSGI script 
> '/home/ubuntu/site-1/apache/wsgi.py'.
> [Mon Nov 03 10:39:27.205378 2014] [wsgi:error] [pid 2517:tid 
> 140701830735616] 2014-11-03 10:39:27,204 (2517/MainThread) 
> newrelic.core.agent INFO - New Relic Python Agent (2.32.0.28)
> [Mon Nov 03 10:39:27.205566 2014] [wsgi:error] [pid 2517:tid 
> 140701637703424] 2014-11-03 10:39:27,205 (2517/Dummy-1) newrelic.core.agent 
> INFO - New Relic Python Agent (2.32.0.28)
> [Mon Nov 03 10:39:30.871519 2014] [wsgi:error] [pid 2517:tid 
> 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) 
> newrelic.core.data_collector INFO - Successfully registered New Relic 
> Python agent where app_name='Snaptee', pid=2517, redirect_host=u'collector-
> 129.newrelic.com' and agent_run_id=1824811776, in 0.50 seconds.
> [Mon Nov 03 10:39:30.871653 2014] [wsgi:error] [pid 2517:tid 
> 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) 
> newrelic.core.data_collector INFO - Reporting to: 
> https://rpm.newrelic.com/accounts/375291/applications/2915135
> …
> [Mon Nov 03 11:21:04.860039 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Shutdown requested 'site-1'.
> …
>
> The above logging is missing the full set of messages one would expect to 
> see on process shutdown. This means it was hung and the Apache parent 
> process would have brutally killed it when it didn't shutdown itself.
>
> Apache logging:
>
> [Mon Nov 03 10:39:25.812190 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Starting process 'site-1-api-ssl' 
> with uid=33, gid=33 and threads=25.
> [Mon Nov 03 10:39:25.812744 2014] [wsgi:info] [pid 2517:tid 
> 140701954910080] mod_wsgi (pid=2517): Initializing Python.
> […
> [Mon Nov 03 11:21:04.857993 2014] [mpm_worker:notice] [pid 1237:tid 
> 140701954910080] AH00297: SIGUSR1 received.  Doing graceful restart
> [Mon Nov 03 11:21:04.858106 2014] [wsgi:info] [pid 1237] mod_wsgi 
> (pid=2517): Process 'site-1-api-ssl' has been deregistered and will no 
> longer be monitored.
>
> From our own tracking of what the New Relic agent is up to, we had a 
> lifetime for that process of:
>
> 5288315 / 1824811776 ip-10-0-21-120 Snaptee 2517 2014-11-03 02:39:31 
> 2014-11-03 
> 03:01:34 2.32.0.28 Apache/mod_wsgi (daemon)
>
> Our time zone is different, but you will note that it had the process life 
> ending at 1 minute after the hour where logs show it was killed at 21 
> minutes past the hour.
>
> The again confirms to me the process was hung. More significantly, 
> whatever occurred also hung our agent background thread which does 
> reporting of data. If it was still running then the life of the process 
> wouldn't have been recorded as ending prematurely. In reality data 
> reporting would have stopped even before that time as we don't mark it dead 
> straight away when we stop getting data and it didn't send a proper 
> disconnect message.
>
> Unfortunately can't get anything more useful from NR metrics as the Lite 
> plan will start summarising data after a day and loose ability to 
> meaningfully zoom into historical data.
>
> So the issue is why the process hung.
>
> The main culprit for that in a multithreaded system is thread deadlock.
>
> This definitely could be an issue in your code but would expect if that 
> was the case that would only affect the request handler threads and not the 
> NR agent background thread as it doesn't call into your code.
>
> The one scenario I do know of that can explain it and we have experienced 
> ourselves and even reported it as a bug with tests to replicate, is an 
> issue in PostgreSQL client libraries which affect Python web applications 
> which also do external HTTP requests using SSL connections. For whatever 
> reason the use of the NR agent changes the dynamics of a web application 
> enough that it seems to trigger the bug in PostgreSQL more easily.
>
> The details of that PostgreSQL bug can be found at:
>
> http://www.postgresql.org/message-id/[email protected]
>
> We are still waiting to see a fix from the PostgreSQL folks about that.
>
> Now, given that the process is hanging, there are a couple of things that 
> can be done.
>
> The first is to set some options on the mod_wsgi daemon process group 
> which will help with detecting a stuck process and automatically restarting 
> it so you don't have to get out of bed in the middle of the night and 
> manually restart it.
>
> The second is to get some Python tracebacks for where it is stuck. Luckily 
> (provided it works), the stuck process detection feature will try and log 
> Python stack trace data for all threads on shutdown when it detects such a 
> problem.
>
> The option you want to use here with WSGIDaemonProcess is request-timeout. 
> The trick here is going to be to work out a good value that is appropriate 
> for your web application and the fact you are using 25 threads.
>
> When one has 1 thread per process, the request-timeout works as a hard 
> limit.
>
> That is, when handling a web request, it takes more than the 
> request-timeout then the process will be restarted to eject the potentially 
> stuck request.
>
> When you have more than one 1 thread per process handling request, a hard 
> limit isn't necessarily a good idea due to the potential to kill off other 
> active requests which are running fine. The graceful-timeout doesn't really 
> help in this case because the stuck thread isn't going to complete anyway, 
> so you end up just going into the more brutal shutdown sequence anyway.
>
> Also, if only one request gets stuck, for example on a backend service 
> call, in your case you would still have 24 other threads, so seems a bit 
> excessive to restart the whole process when it could chug along quite 
> happily anyway and the request may still complete after a while.
>
> How request-timeout is interpreted therefore in a multi threaded situation 
> is that if you have set it to 45 seconds for example, when the average 
> accumulated response time across all request thread slots for active 
> requests has reached 45 seconds, only then will it restart the process.
>
> Imagine a worst case scenario where 25 requests came in at the same time 
> and they all managed to deadlock on this PostgreSQL issue. As they started 
> at the same time, then 45 seconds later, they will all have been running 
> for 45 seconds and so the process will be restarted.
>
> Now imagine though that only one thread had got stuck, after 45 seconds 
> the process wouldn't be restarted as the average across all request thread 
> slots would be 45/25 if there were no other requests running.
>
> So the algorithm for now at least, tries to deal with the fact that there 
> may still be threads which can be used and allow the process to run as long 
> as possible before forcing a restart.
>
> If further threads start getting stuck and as their running time grows, 
> then the average will keep growing and eventually you will hit the 45 
> second average. This will obviously occur quite quickly if all threads 
> start getting stuck at the same time as seemed to happen in your case.
>
> My suggestion therefore is to set request-timeout on WSGIDaemonProcess. 
> Select a timeout value (in seconds), which is above your typical worst case 
> scenario for response time. The 99th percentile for your requests seems to 
> be up around 25 seconds, so perhaps start with 60 seconds for the timeout 
> initially.
>
> So like with maximum-requests, we are putting in place something that will 
> allow your system to keep going and recover, or avoid problems, while you 
> work out the cause of why the process is hanging.
>
> For the request-timeout one. Although the process will now recover, you 
> will have a bunch of backlogged requests which will immediately get sent 
> into the daemon process and temporarily overload it. Those requests could 
> have hung around for quite a while and likely users no longer waiting. For 
> these I would suggest setting queue-timeout option to WSGIDaemonProcess. 
> This will cause the backlog to be discarded with a service unavailable 
> allowing the daemon process to better recover.
>
> Initially, set the queue-timeout to be 45 seconds and can see how it goes.
>
> If the problem occurs again, you may only know by messages in Apache error 
> logs about request and queue timeouts. If it occurs, keep an eye out for 
> stack trace information as well in the logs.
>
> There are also various other time outs one can set to override some 
> defaults. Some defaults are based off the value of the Timeout directive in 
> Apache.
>
> Can you tell me what your Apache configuration currently has the Timeout 
> directive set to, or whether it is using the default. The builtin default 
> for Timeout should be 60. Some same configuration files override that to be 
> 300 which is a ridiculously high value. Based on some of the log messages I 
> suspect in your system it is set to 300.
>
> Depending on what you say, may set a few other timeout values as well so 
> as to best setup recovery for failure cases.
>
> Graham
>
> On 04/11/2014, at 8:58 PM, Kelvin Wong <[email protected]> wrote:
>
> I attached the main Apache error log for the same period of time.
>
>
> For the system setup,
> 1. I have no backup process for that instance. it's a web app server only.
>
> 2. Except the web application, no process connecting to database.
>
> 3. I used AWS Elastic load balancer and New Relic pinger to monitor the 
> liveness of the website, not specified for Apache. All restart will be 
> triggered manually for now.
>
>
> Thanks.
>
>
> On Tuesday, November 4, 2014 5:43:28 PM UTC+8, Graham Dumpleton wrote:
>
> Can you supply the main Apache error log for the same period of time? Not 
> all information is in the virtual host specific error log, including key 
> information which captures when an Apache restart was triggered.
>
> Based on the incomplete information from just that log, it opens the 
> possibility that the restart was actually the outcome of problems which 
> arose and not the trigger. I really need that main Apache error log to be 
> sure.
>
> As well as supplying the main Apache error log, can you indicate whether 
> your system is set up with any of these things.
>
> 1. Any sort of backup process, be it file system backups or database 
> backups. If there are such backup services, what time of day do they run.
>
> 2. Any sort of intensive batch update processing which loads data into any 
> database at a specific time of day.
>
> 3. Do you have any monitoring system which checks for the liveness of 
> Apache and which would force a restart if it was unable to poll Apache and 
> thus though it was hung.
>
> Lack of a permalink from NR UI (bottom right of pages) makes it hard to 
> find the specific time of day for the event you are talking about because 
> of uncertainties of timezones, but I see evidence of various other strange 
> things going on at times. Lets see those main Apache error logs first 
> though and then can properly come up with a plausible explanation of what 
> may have triggered things.
>
> Graham
>
> On 04/11/2014, at 2:53 AM, Kelvin Wong <[email protected]> wrote:
>
> I attached the error log. See it is enough.
>
>
> On Monday, November 3, 2014 8:10:00 PM UTC+8, Graham Dumpleton wrote:
>
> Can you provide further back in time and further forward in time.
>
> Basically, from when everything looked normal until everything was back to 
> normal.
>
> I get the feel that this has chopped off bits and not everything I might 
> want to see is there.
>
> Sleep time here. Will then look tomorrow.
>
> Graham
>
> On 03/11/2014, at 10:25 PM, Kelvin Wong <[email protected]> wrote:
>
> I obtained log messages on production when the error happened. The log 
> level is set to 'info'.
>
> [Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'.
> [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'.
> [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Destroying interpreters.
> [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'.
> [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'.
> [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Destroying interpreters.
> [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'.
> [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'.
> [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Destroying interpreters.
> [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Terminating Python.
> [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Python has shutdown.
> [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid 
> 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'.
> [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Terminating Python.
> [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Python has shutdown.
> [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid 
> 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'.
> [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Terminating Python.
> [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Python has shutdown.
> [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid 
> 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'.
> [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid 
> 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid 
> 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid 
> 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid 
> 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid 
> 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid 
> 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid 
> 140701807068928] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid 
> 140701660579584] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid 
> 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid 
> 140701823854336] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid 
> 140701593437952] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid 
> 140701652186880] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid 
> 140701840639744] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid 
> 140701719328512] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid 
> 140701685757696] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid 
> 140701823854336] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid 
> 140701585045248] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid 
> 140701652186880] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid 
> 140701719328512] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid 
> 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response 
> headers from daemon process 'site-1': 
> /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid 
> 140701610223360] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid 
> 140701601830656] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid 
> 140701660579584] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid 
> 140701807068928] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid 
> 140701601830656] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid 
> 140701568259840] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid 
> 140701677364992] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid 
> 140701668972288] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid 
> 140701610223360] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid 
> 140701815461632] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid 
> 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering 
> immediate shutdown 'site-1'.
> [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'.
> [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'.
> [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Destroying interpreters.
> [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid 
> 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''.
> [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid 
> 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent 
> INFO - New Relic Python Agent Shutdown
> [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid 
> 140701710935808] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
>
> Please advise. Thanks.
>
>
> On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote:
>
> On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected]> wrote:
>
> I don't think the original error situation is caused by server overloading 
> as the traffic of the server on that time it's fine.
>
>
> I didn't say it was.
>
> I just want to rule out some conditions as returning 503 can be caused by 
> overloading the service.
>
> Overloading the servers can produce 503 but the error log messages do not 
> match with the original error log message.
>
>
> And I am only addressing the message:
>
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] 
> [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: 
> [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI 
> daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.
>
> That area of the code was rewritten in 4.X versions at some point and 
> introduced non blocking connects. Looking more at that specific message and 
> knowing the possibility that you were load testing I came across that 
> difference in Linux behaviour compared to what POSIX specifies.
>
> So all I can do is address that specific point and that it is all I can do 
> until I can get those info level messages from the error logs so I can 
> understand whether the mod_wsgi daemon processes are actually restarting at 
> the same time.
>
> Graham
>
>
> On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote:
>
> I still want any log messages either side of that point as it may confirm 
> what is going on, but what you might be encountering could be due to Linux 
> behaviour when performing a connect() on a UNIX socket in non blocking mode 
> not being POSIX compliant. Details:
>
> <a href="http://lists.mysql.com/internals/35082"; target="_blank" 
> onmousedown="this.href='
> http://www.google.com/url?q\75http%3A%2F%2Flists.mysql.com%2Finternals%2F35082\46sa\75D\46sntz\0751\46usg\75AFQjCNH0kf_bC493Rq4t-IdCLZdF5iAnwg';return
>  
> <http://www.google.com/url?q%5C75http%3A%2F%2Flists.mysql.com%2Finternals%2F35082%5C46sa%5C75D%5C46sntz%5C0751%5C46usg%5C75AFQjCNH0kf_bC493Rq4t-IdCLZdF5iAnwg';return>
>  
> true;" onclick="this.href='
> http://www.google.com/url?q\75http%3A%2F%2Flists.mysql.com%2Finternals%2F35082\46sa\75D\46sntz\0751\46usg\75AFQjCNH0kf_bC493Rq
>  
> <http://www.google.com/url?q%5C75http%3A%2F%2Flists.mysql.com%2Finternals%2F35082%5C46sa%5C75D%5C46sntz%5C0751%5C46usg%5C75AFQjCNH0kf_bC493Rq>
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Reply via email to