Re: [modwsgi] mod_wsgi returning 503 service unavailable

Graham Dumpleton Wed, 05 Nov 2014 02:34:41 -0800

If all the request threads reported as being stuck in the PostgreSQL connect() 
call, then it would appear to be the same issue.


The problem is caused by a bug in the libpq client library for PostgreSQL. That 
there is a problem has been acknowledged by the PostgreSQL developers in:

http://www.postgresql.org/message-id/[email protected]

Unfortunately the PostgreSQL developers have no public issue tracker that I 
know of so I have no idea of the status of the issue and whether they have a 
fix or whether one has been released yet. Given that was only acknowledged 
recently, likely not. Even if they had come out with a fix, that Linux 
distributions for past versions do not update packages except for security 
fixes, if you are using a packaged PostgreSQL then you likely will never see 
the fix while you are using the older Linux version.

As to the scenario which causes the problem, it arises when you are talking to 
your PostgreSQL database using a SSL connection and where at the same time, 
other code within your application in parallel threads is making separate SSL 
connections out to backend web services or remote web sites.

So if for example any of your HTTP call outs to Instagram, Facebook or Amazon 
S3 are over SSL connections, you can hit this problem. That the NR agent also 
makes HTTPS call outs once a minute will add to the possibility that you could 
hit the bug in PostgreSQL, but the issue can arise independent of whether you 
are using the NR agent.

Getting down to the actual bug in PostgreSQL, what it does when it is making 
SSL connections is that when it hasn't already got an SSL connection to the 
database open, it will the first time do some initialisation in the SSL 
libraries to register some callbacks to allow it to manage some locks on some 
resource. It will keep these callbacks in place so long as it has at least one 
active SSL connection. When all the SSL connections to the database are closed, 
it will deregister its callbacks. They would be reregistered the next time it 
tries to create a database connection again.

The problem is where a separate SSL connection is created for a HTTP request or 
something else, the duration of which starts while there is an active database 
SSL connection, but finishes after the last database SSL connection had closed.

In this situation, the HTTP request over SSL will still go into the callbacks 
setup by PostgreSQL to do the locking it wants, but when the HTTP request 
completes, PostgreSQL has removed those callbacks and so they aren't called and 
the locks that were created at the start of the HTTP request are not released. 
Now the next time that an SSL connection is created by PostgreSQL, it will 
setup its callbacks again and try and acquire the locks again, but since they 
were never released at the end of the HTTP request, it deadlocks instead. Over 
time as each request handler is run, this will happen to all threads and so the 
whole process gets stuck.

Overall it is a pretty evil bug so am not sure why it hasn't been noticed 
before. It does actually explain though cases which have popped up over the 
years where mod_wsgi users web applications were deadlocking, but where never 
got sufficient information out of people to track it down.

What can you do.

If you can't avoid having to use SSL connections for outbound HTTP connections 
from your application it is going to be hard to avoid it completely.

There are perhaps a couple of things you could do to lesson the possibility of 
it occurring though.

The first is not use SSL connections for HTTP calls where you don't absolutely 
require it. Obviously this usually isn't going to be possible though.

You might for example configure the NR agent not to use SSL connections by 
setting:

ssl = false

in the agent configuration file.

This does mean that data goes in the clear. For metric data this isn't going to 
be a problem as it is generally innocuous. The only area of concern may be 
error details, which can contain Python traceback information.

Making that change for the NR agent isn't going to help you with Instagram. 
Facebook and Amazon S3 if they must be SSL.

The other thing you can do is if using Django, is to use a new enough Django 
version that you can use the database connection pooling feature of Django. 
This will help as normally Django creates a new database connection per request 
and so there is lots of opportunities where there will not always be an active 
database connection and so a HTTP request over an SSL connection can occur 
which overlaps when the database connection was closed.

So by using the database connection pooling mechanism in Django, then the 
database connections will be kept active for the life of the whole process, 
thereby pretty well completely eliminating the probability it can occur.

Graham

On 05/11/2014, at 6:45 PM, Kelvin Wong <[email protected]> wrote:

> I attached the stack trace when the request-timeout triggered.
> Can you give me some hints on what it is happening?
> It seemed related to the issue you mentioned.
> 
> Thanks.
> 
> On Wednesday, November 5, 2014 11:10:54 AM UTC+8, Graham Dumpleton wrote:
> If the request-timeout mechanism of WSGIDaemonProcess triggers, it will 
> attempt to log stack frames for each request thread into the Apache error 
> log. That will be the first thing to look at.
> 
> The alternative is:
> 
> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Extracting_Python_Stack_Traces
> 
> but you need to be present to trigger it to dump the stack traces, which is 
> why going to see whether the new feature of mod_wsgi to dump them 
> automatically will yield results first.
> 
> Graham
> 
> On 05/11/2014, at 2:02 PM, Kelvin Wong <[email protected]> wrote:
> 
> Thanks for your suggestions.
> I will try request-timeout and queue-timeout settings for the process 
> recovery.
> 
> Apache configuration currently the Timeout directive is set to default 300, I 
> will set it to 60.
> 
> But can you suggest some guides and ways to spot out the deadlocked requests?
> Can it be spotted out on code or adding some logging in the code?
> 
> Thanks again.
> 
> On Tuesday, November 4, 2014 8:26:38 PM UTC+8, Graham Dumpleton wrote:
> Here is what I have.
> 
> You appear to be running daemon process groups with only a single process. 
> For one of those daemon process groups, the daemon process hung.
> 
> The life time of the process is shown by the following logs.
> 
> Daemon logging:
> 
> [Mon Nov 03 10:39:25.824186 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Attach interpreter ''.
> [Mon Nov 03 10:39:25.824339 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Adding '/home/ubuntu/site-1/django' to path.
> [Mon Nov 03 10:39:25.824488 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Adding 
> '/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages' to path.
> [Mon Nov 03 10:39:25.826052 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Imported 'mod_wsgi'.
> [Mon Nov 03 10:39:25.828107 2014] [wsgi:info] [pid 2517:tid 140701830735616] 
> [remote xx.xx.xx.xx:64697] mod_wsgi (pid=2517, process='site-1', 
> application=''): Loading WSGI script '/home/ubuntu/site-1/apache/wsgi.py'.
> [Mon Nov 03 10:39:27.205378 2014] [wsgi:error] [pid 2517:tid 140701830735616] 
> 2014-11-03 10:39:27,204 (2517/MainThread) newrelic.core.agent INFO - New 
> Relic Python Agent (2.32.0.28)
> [Mon Nov 03 10:39:27.205566 2014] [wsgi:error] [pid 2517:tid 140701637703424] 
> 2014-11-03 10:39:27,205 (2517/Dummy-1) newrelic.core.agent INFO - New Relic 
> Python Agent (2.32.0.28)
> [Mon Nov 03 10:39:30.871519 2014] [wsgi:error] [pid 2517:tid 140701294765824] 
> 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) 
> newrelic.core.data_collector INFO - Successfully registered New Relic Python 
> agent where app_name='Snaptee', pid=2517, 
> redirect_host=u'collector-129.newrelic.com' and agent_run_id=1824811776, in 
> 0.50 seconds.
> [Mon Nov 03 10:39:30.871653 2014] [wsgi:error] [pid 2517:tid 140701294765824] 
> 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) 
> newrelic.core.data_collector INFO - Reporting to: 
> https://rpm.newrelic.com/accounts/375291/applications/2915135
> …
> [Mon Nov 03 11:21:04.860039 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Shutdown requested 'site-1'.
> …
> 
> The above logging is missing the full set of messages one would expect to see 
> on process shutdown. This means it was hung and the Apache parent process 
> would have brutally killed it when it didn't shutdown itself.
> 
> Apache logging:
> 
> [Mon Nov 03 10:39:25.812190 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Starting process 'site-1-api-ssl' with uid=33, gid=33 
> and threads=25.
> [Mon Nov 03 10:39:25.812744 2014] [wsgi:info] [pid 2517:tid 140701954910080] 
> mod_wsgi (pid=2517): Initializing Python.
> […
> [Mon Nov 03 11:21:04.857993 2014] [mpm_worker:notice] [pid 1237:tid 
> 140701954910080] AH00297: SIGUSR1 received.  Doing graceful restart
> [Mon Nov 03 11:21:04.858106 2014] [wsgi:info] [pid 1237] mod_wsgi (pid=2517): 
> Process 'site-1-api-ssl' has been deregistered and will no longer be 
> monitored.
> 
> From our own tracking of what the New Relic agent is up to, we had a lifetime 
> for that process of:
> 
> 5288315 / 1824811776  ip-10-0-21-120  Snaptee 2517    2014-11-03 02:39:31     
> 2014-11-03 03:01:34     2.32.0.28       Apache/mod_wsgi (daemon)
> 
> Our time zone is different, but you will note that it had the process life 
> ending at 1 minute after the hour where logs show it was killed at 21 minutes 
> past the hour.
> 
> The again confirms to me the process was hung. More significantly, whatever 
> occurred also hung our agent background thread which does reporting of data. 
> If it was still running then the life of the process wouldn't have been 
> recorded as ending prematurely. In reality data reporting would have stopped 
> even before that time as we don't mark it dead straight away when we stop 
> getting data and it didn't send a proper disconnect message.
> 
> Unfortunately can't get anything more useful from NR metrics as the Lite plan 
> will start summarising data after a day and loose ability to meaningfully 
> zoom into historical data.
> 
> So the issue is why the process hung.
> 
> The main culprit for that in a multithreaded system is thread deadlock.
> 
> This definitely could be an issue in your code but would expect if that was 
> the case that would only affect the request handler threads and not the NR 
> agent background thread as it doesn't call into your code.
> 
> The one scenario I do know of that can explain it and we have experienced 
> ourselves and even reported it as a bug with tests to replicate, is an issue 
> in PostgreSQL client libraries which affect Python web applications which 
> also do external HTTP requests using SSL connections. For whatever reason the 
> use of the NR agent changes the dynamics of a web application enough that it 
> seems to trigger the bug in PostgreSQL more easily.
> 
> The details of that PostgreSQL bug can be found at:
> 
> http://www.postgresql.org/message-id/[email protected]
> 
> We are still waiting to see a fix from the PostgreSQL folks about that.
> 
> Now, given that the process is hanging, there are a couple of things that can 
> be done.
> 
> The first is to set some options on the mod_wsgi daemon process group which 
> will help with detecting a stuck process and automatically restarting it so 
> you don't have to get out of bed in the middle of the night and manually 
> restart it.
> 
> The second is to get some Python tracebacks for where it is stuck. Luckily 
> (provided it works), the stuck process detection feature will try and log 
> Python stack trace data for all threads on shutdown when it detects such a 
> problem.
> 
> The option you want to use here with WSGIDaemonProcess is request-timeout. 
> The trick here is going to be to work out a good value that is appropriate 
> for your web application and the fact you are using 25 threads.
> 
> When one has 1 thread per process, the request-timeout works as a hard limit.
> 
> That is, when handling a web request, it takes more than the request-timeout 
> then the process will be restarted to eject the potentially stuck request.
> 
> When you have more than one 1 thread per process handling request, a hard 
> limit isn't necessarily a good idea due to the potential to kill off other 
> active requests which are running fine. The graceful-timeout doesn't really 
> help in this case because the stuck thread isn't going to complete anyway, so 
> you end up just going into the more brutal shutdown sequence anyway.
> 
> Also, if only one request gets stuck, for example on a backend service call, 
> in your case you would still have 24 other threads, so seems a bit excessive 
> to restart the whole process when it could chug along quite happily anyway 
> and the request may still complete after a while.
> 
> How request-timeout is interpreted therefore in a multi threaded situation is 
> that if you have set it to 45 seconds for example, when the average 
> accumulated response time across all request thread slots for active requests 
> has reached 45 seconds, only then will it restart the process.
> 
> Imagine a worst case scenario where 25 requests came in at the same time and 
> they all managed to deadlock on this PostgreSQL issue. As they started at the 
> same time, then 45 seconds later, they will all have been running for 45 
> seconds and so the process will be restarted.
> 
> Now imagine though that only one thread had got stuck, after 45 seconds the 
> process wouldn't be restarted as the average across all request thread slots 
> would be 45/25 if there were no other requests running.
> 
> So the algorithm for now at least, tries to deal with the fact that there may 
> still be threads which can be used and allow the process to run as long as 
> possible before forcing a restart.
> 
> If further threads start getting stuck and as their running time grows, then 
> the average will keep growing and eventually you will hit the 45 second 
> average. This will obviously occur quite quickly if all threads start getting 
> stuck at the same time as seemed to happen in your case.
> 
> My suggestion therefore is to set request-timeout on WSGIDaemonProcess. 
> Select a timeout value (in seconds), which is above your typical worst case 
> scenario for response time. The 99th percentile for your requests seems to be 
> up around 25 seconds, so perhaps start with 60 seconds for the timeout 
> initially.
> 
> So like with maximum-requests, we are putting in place something that will 
> allow your system to keep going and recover, or avoid problems, while you 
> work out the cause of why the process is hanging.
> 
> For the request-timeout one. Although the process will now recover, you will 
> have a bunch of backlogged requests which will immediately get sent into the 
> daemon process and temporarily overload it. Those requests could have hung 
> around for quite a while and likely users no longer waiting. For these I 
> would suggest setting queue-timeout option to WSGIDaemonProcess. This will 
> cause the backlog to be discarded with a service unavailable allowing the 
> daemon process to better recover.
> 
> Initially, set the queue-timeout to be 45 seconds and can see how it goes.
> 
> If the problem occurs again, you may only know by messages in Apache error 
> logs about request and queue timeouts. If it occurs, keep an eye out for 
> stack trace information as well in the logs.
> 
> There are also various other time outs one can set to override some defaults. 
> Some defaults are based off the value of the Timeout directive in Apache.
> 
> Can you tell me what your Apache configuration currently has the Timeout 
> directive set to, or whether it is using the default. The builtin default for 
> Timeout should be 60. Some same configuration files override that to be 300 
> which is a ridiculously high value. Based on some of the log messages I 
> suspect in your system it is set to 300.
> 
> Depending on what you say, may set a few other timeout values as well so as 
> to best setup recovery for failure cases.
> 
> Graham
> 
> On 04/11/2014, at 8:58 PM, Kelvin Wong <[email protected]> wrote:
> 
> I attached the main Apache error log for the same period of time.
> 
> 
> For the system setup,
> 1. I have no backup process for that instance. it's a web app server only.
> 
> 2. Except the web application, no process connecting to database.
> 
> 3. I used AWS Elastic load balancer and New Relic pinger to monitor the 
> liveness of the website, not specified for Apache. All restart will be 
> triggered manually for now.
> 
> 
> Thanks.
> 
> 
> On Tuesday, November 4, 2014 5:43:28 PM UTC+8, Graham Dumpleton wrote:
> Can you supply the main Apache error log for the same period of time? Not all 
> information is in the virtual host specific error log, including key 
> information which captures when an Apache restart was triggered.
> 
> Based on the incomplete information from just that log, it opens the 
> possibility that the restart was actually the outcome of problems which arose 
> and not the trigger. I really need that main Apache error log to be sure.
> 
> As well as supplying the main Apache error log, can you indicate whether your 
> system is set up with any of these things.
> 
> 1. Any sort of backup process, be it file system backups or database backups. 
> If there are such backup services, what time of day do they run.
> 
> 2. Any sort of intensive batch update processing which loads data into any 
> database at a specific time of day.
> 
> 3. Do you have any monitoring system which checks for the liveness of Apache 
> and which would force a restart if it was unable to poll Apache and thus 
> though it was hung.
> 
> Lack of a permalink from NR UI (bottom right of pages) makes it hard to find 
> the specific time of day for the event you are talking about because of 
> uncertainties of timezones, but I see evidence of various other strange 
> things going on at times. Lets see those main Apache error logs first though 
> and then can properly come up with a plausible explanation of what may have 
> triggered things.
> 
> Graham
> 
> On 04/11/2014, at 2:53 AM, Kelvin Wong <[email protected]> wrote:
> 
> I attached the error log. See it is enough.
> 
> 
> On Monday, November 3, 2014 8:10:00 PM UTC+8, Graham Dumpleton wrote:
> Can you provide further back in time and further forward in time.
> 
> Basically, from when everything looked normal until everything was back to 
> normal.
> 
> I get the feel that this has chopped off bits and not everything I might want 
> to see is there.
> 
> Sleep time here. Will then look tomorrow.
> 
> Graham
> 
> On 03/11/2014, at 10:25 PM, Kelvin Wong <[email protected]> wrote:
> 
> I obtained log messages on production when the error happened. The log level 
> is set to 'info'.
> 
> [Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Shutdown requested 'site-2'.
> [Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Stopping process 'site-2'.
> [Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Destroying interpreters.
> [Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Shutdown requested 'site-1'.
> [Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Stopping process 'site-1'.
> [Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Destroying interpreters.
> [Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Shutdown requested 'site-3'.
> [Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Stopping process 'site-3'.
> [Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Destroying interpreters.
> [Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Cleanup interpreter ''.
> [Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Terminating Python.
> [Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Python has shutdown.
> [Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid 139644458678144] 
> mod_wsgi (pid=4476): Exiting process 'site-3'.
> [Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Terminating Python.
> [Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Python has shutdown.
> [Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid 139644458678144] 
> mod_wsgi (pid=4477): Exiting process 'site-1'.
> [Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Terminating Python.
> [Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Python has shutdown.
> [Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid 139644458678144] 
> mod_wsgi (pid=4478): Exiting process 'site-2'.
> [Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid 140701685757696] 
> [client xx.xx.xx.xx:13183] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid 140701627008768] 
> [client xx.xx.xx.xx:9079] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid 140701568259840] 
> [client xx.xx.xx.xx:13194] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid 140701593437952] 
> [client xx.xx.xx.xx:9123] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid 140701643794176] 
> [client xx.xx.xx.xx:9145] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid 140701840639744] 
> [client xx.xx.xx.xx:13230] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid 140701840639744] 
> [client xx.xx.xx.xx:9723] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid 140701807068928] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10126] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid 140701660579584] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10148] mod_wsgi 
> (pid=1712): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid 140701677364992] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10157] mod_wsgi 
> (pid=1712): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid 140701685757696] 
> [client xx.xx.xx.xx:13868] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid 140701823854336] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14302] mod_wsgi 
> (pid=1712): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid 140701568259840] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14306] mod_wsgi 
> (pid=3039): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid 140701593437952] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14308] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid 140701677364992] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14312] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid 140701652186880] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10163] mod_wsgi 
> (pid=1712): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid 140701840639744] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14314] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid 140701719328512] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10173] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid 140701568259840] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14316] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid 140701685757696] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10175] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid 140701823854336] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14318] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid 140701585045248] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10178] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid 140701652186880] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14324] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid 140701719328512] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14331] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid 140701576652544] 
> [client xx.xx.xx.xx:9777] Timeout when reading response headers from daemon 
> process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
> [Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid 140701610223360] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14337] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid 140701601830656] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10185] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid 140701660579584] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10206] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid 140701807068928] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14353] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid 140701601830656] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10210] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid 140701568259840] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10213] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid 140701677364992] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14356] mod_wsgi 
> (pid=2026): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid 140701668972288] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10219] mod_wsgi 
> (pid=3039): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid 140701610223360] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14359] mod_wsgi 
> (pid=2417): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid 140701815461632] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10221] mod_wsgi 
> (pid=3039): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> [Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid 140056943240960] 
> mod_wsgi (pid=3794): Maximum requests reached, triggering immediate shutdown 
> 'site-1'.
> [Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid 140057235269504] 
> mod_wsgi (pid=3794): Shutdown requested 'site-1'.
> [Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid 140057235269504] 
> mod_wsgi (pid=3794): Stopping process 'site-1'.
> [Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid 140057235269504] 
> mod_wsgi (pid=3794): Destroying interpreters.
> [Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid 140057235269504] 
> mod_wsgi (pid=3794): Cleanup interpreter ''.
> [Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid 140057235269504] 
> 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent INFO - New Relic 
> Python Agent Shutdown
> [Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid 140701710935808] 
> (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10226] mod_wsgi 
> (pid=1712): Unable to connect to WSGI daemon process 'site-1' on 
> '/var/run/apache2/wsgi.1237.0.4.sock'.
> 
> Please advise. Thanks.
> 
> 
> On Monday, November 3, 2014 6:43:19 PM UTC+8, Graham Dumpleton wrote:
> On 03/11/2014, at 9:35 PM, Kelvin Wong <[email protected]> wrote:
> 
> I don't think the original error situation is caused by server overloading as 
> the traffic of the server on that time it's fine.
> 
> I didn't say it was.
> 
> I just want to rule out some conditions as returning 503 can be caused by 
> overloading the service.
> Overloading the servers can produce 503 but the error log messages do not 
> match with the original error log message.
> 
> And I am only addressing the message:
> 
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] [pid 
> 4615:tid 140372718073600] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.
> 
> That area of the code was rewritten in 4.X versions at some point and 
> introduced non blocking connects. Looking more at that specific message and 
> knowing the possibility that you were load testing I came across that 
> difference in Linux behaviour compared to what POSIX specifies.
> 
> So all I can do is address that specific point and that it is all I can do 
> until I can get those info level messages from the error logs so I can 
> understand whether the mod_wsgi daemon processes are actually restarting at 
> the same time.
> 
> Graham
> 
> 
> On Monday, November 3, 2014 5:58:47 PM UTC+8, Graham Dumpleton wrote:
> I still want any log messages either side of that point as it may confirm 
> what is going on, but what you might be encountering could be due to Linux 
> behaviour when performing a connect() on a UNIX socket in non blocking mode 
> not being POSIX compliant. Details:
> 
> http://lists.mysql.com/internals/35082
> 
> The mod_wsgi code is possibly not dealing with Linux's non POSIX compliance 
> and instead of interpreting it the same as connection refused (when the 
> listener socket backlog is full), it is regarding it as an outright failure.
> 
> So the cause of this if you are trying to do load testing is that you are 
> overwhelming the capacity of the mod_wsgi daemon processes causing the socket 
> listener queue to fill up. When it fills up then it is triggering the non 
> POSIX behaviour of Linux.
> 
> Can you confirm that you only saw this when you overwhelmed the server with 
> more traffic that you would realistically expect it to receive and way beyond 
> the capacity it could handle based on how it is configured?
> 
> Graham
> 
> On 03/11/2014, at 8:33 PM, Graham Dumpleton <[email protected]> wrote:
> 
> As I said before, I need to see messages either side of the event and with 
> Apache log level set to info.
> 
> Graham
> 
> On 03/11/2014, at 8:28 PM, Kelvin Wong <[email protected]> wrote:
> 
> I tried some load testing on the server to see if it is caused by overloading.
> 
> I found there will be some other log message when the server is overloading.
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] [pid 
> 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or oversized 
> response headers received from daemon process 'site-1': 
> /home/ubuntu/site-1/apache/wsgi.py
> 2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] [pid 
> 4615:tid 140372718073600] (11)Resource temporarily unavailable: [client 
> xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI daemon 
> process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.
> 
> Then I concluded that the situation is not caused by overloading the server.
> 
> Do you have any directions for me to reproduce the situation?
> 
> Thanks.
> 
> 
> On Monday, November 3, 2014 11:09:33 AM UTC+8, Kelvin Wong wrote:
> I found no cron is running on root and user.
> 
> According to the logrotate config, the log rotate weekly.
> But the error situation is likely happened nightly.
> 
> So I think it will not be my concern.
> 
> 
> On Monday, November 3, 2014 11:00:10 AM UTC+8, Graham Dumpleton wrote:
> And when is 'cron' configured to run logrotate? Does it correspond to when 
> you are seeing the issue?
> 
> On 03/11/2014, at 1:56 PM, Kelvin Wong <[email protected]> wrote:
> 
> Yes, the memory usage is under control now and I increased maximum-requests 
> by 4 times.
> 
> For the error, 
> Timeout when reading response headers from daemon process 'site-1': 
> /home/ubuntu/site-1/apache/wsgi.py
> came out as sudden, that's why I capture this message only. I also found main 
> Apache error log nothing special.
> 
> Now, I turned the log level to 'info' in order to obtain more messages.
> 
> About the log file rotation, I only found the ubuntu default rotation one 
> which is using logrotate.
> /var/log/apache2/*.log {
>         weekly
>         missingok
>         rotate 52
>         compress
>         delaycompress
>         notifempty
>         create 640 root adm
>         sharedscripts
>         postrotate
>                 if /etc/init.d/apache2 status > /dev/null ; then \
>                     /etc/init.d/apache2 reload > /dev/null; \
>                 fi;
>         endscript
>         prerotate
>                 if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
>                         run-parts /etc/logrotate.d/httpd-prerotate; \
>                 fi; \
>      
> ...
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "modwsgi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/modwsgi.
> For more options, visit https://groups.google.com/d/optout.
> <stacktrace.log>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Re: [modwsgi] mod_wsgi returning 503 service unavailable

Reply via email to