Hi all - Let me start out with the assumptions I'm going from for what I want to talk about.
1. I'm looking at Nova right now, but I think similar things are going on in other Openstack apps. 2. Settings that we see in nova.conf, including: #wsgi_default_pool_size = 1000 #max_pool_size = <None> #max_overflow = <None> #osapi_compute_workers = <None> #metadata_workers = <None> are often not understood by deployers, and/or are left unchanged in a wide variety of scenarios. If you are in fact working for deployers that *do* change these values to something totally different, then you might not be impacted here, and if it turns out that everyone changes all these settings in real-world scenarios and zzzeek you are just being silly thinking nobody sets these appropriately, then fooey for me, I guess. 3. There's talk about more Openstack services, at least Nova from what I heard the other day, moving to be based on a real webserver deployment in any case, the same way Keystone is. To the degree this is true would also mitigate what I'm seeing but still, there's good changes that can be made here. Basically, the syndrome I want to talk about can be mostly mitigated just by changing the numbers around in #2, but I don't really know that people know any of this, and also I think some of the defaults here should just be changed completely as their current values are useless in pretty much all cases. Suppose we run on a 24-core machine, and therefore have 24 API worker processes. Each worker represents a WSGI server, which will use an eventlet greenlet pool with 1000 greenlets. Then, assuming neither max_pool_size or max_overflow is changed, this indicates that for a single SQLAlchemy Engine, the most database connections that are allowed by this Engine at one time is *15*. pool_size defaults to 5 and max_overflow defaults to 10. We get our engine from oslo.db however oslo.db does not change these defaults which ultimately come from SQLAlchemy itself. The problem is then basically that 1000 greenlets is way, way more than 15, meaning hundreds of requests can all pile up on a process and all be blocked, waiting for a connection pool that's been configured to only allow 15 database connections at most. But wait! You say. We have twenty-four worker processes. So if we had 100 concurrent requests, these requests would not all pile up on just one process, they'd be distributed among the workers. Any additional requests beyond the 15 * 24 == 360 that we can handle (assuming a simplistic relationship between API requests and database connections, which it is not) would just queue up as they do anyway, so it makes no difference. Right? **Right???* It does make a difference! Because show me in nova source code where exactly this algorithm is that knows how to distribute requests evenly among the workers...There is no such logic! Some months ago, I began thinking and fretting, how the heck does this work? There's 24 workers, one socket.accept(), requests come in and sockets are somehow divvyed up to child forks, but *how*? I asked some of the deep unix gurus locally here, and the best answer we could come up with is: it's random! Cue the mythbusters music. "Nova receives WSGI requests and sends them to workers with a random distribution, meaning that under load, some workers will have too many requests and be waiting on DB access which can in fact cause pool timeout issues under very latent circumstances, others will be more idle than they should be". As we begin the show, we cut into a background segment where we show that in fact, Mike and some other folks doing load testing actually *see* connection pool timeout errors in the logs already, on a 24 core machine, even though we see hundreds of idle connections at the same time (just to note, the error we are talking about is "QueuePool limit of size 5 overflow 5 reached, connection timed out, timeout 5"). So that we actually see this happening in an actual test situation is what led me to finally just write a test suite for the whole thing. Here's the test suite! https://gist.github.com/zzzeek/c69138fd0d0b3e553a1f I've tried to make this as super-simple as possible to read, use, and understand. It uses Nova's nova.wsgi.Server directly with a simple "hello-world" style app, as well as oslo_service.service.Service and service.launch() the same way I see in nova/service.py (please confirm I'm using all the right code and things here just like in Nova, thanks!). The "hello world" app gets a connection from the pool, does nothing with it, waits a few seconds then returns it. All the while counting everything going on and reporting on its metrics every 10 requests. The "hello world" app uses a SQLAlchemy connection pool with a little bit lower number of connections, and a timeout of only ten seconds instead of thirty by default (but feel free to change it on the command line), and a "work" operation that takes a random amount of time between zero and five seconds, just to make the problem more obviously reproducible on any hardware. When we leave the default greenlets at 1000 and hit the server with Apache ab and concurrency of at least 75, there are connection pool timeouts galore, and the metrics also show workers waiting anywhere from a full second to 5 seconds (before timing out) for a database connection: INFO:loadtest:Status for pid 32625: avg wait time for connection 1.4358 sec; worst wait time 3.9267 sec; connection failures 5; num requests over the limit: 29; max concurrency seen: 25 ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5 reached, connection timed out, timeout 5 Bring the number of greenlets down to *ten* (yes, only ten) and the errors go to zero, the ab test will complete the given number of requests *faster* than it does with the 1000-greenlet version. The average time a worker spends waiting for a database connection drops an order of magnitude: INFO:loadtest:Status for pid 460: avg wait time for connection 0.0140 sec; worst wait time 0.0540 sec; connection failures 0; num requests over the limit: 0; max concurrency seen: 11 That's even though our worker's "fake" work requests are still taking as long as 5 seconds per request to complete. But if we only have a super low number of greenlets and only a few dozen workers, what happens if we have more than 240 requests come in at once, aren't those connections going to get rejected? No way! eventlet's networking system is better than that, those connection requests just get queued up in any case, waiting for a greenlet to be available. Play with the script and its settings to see. But if we're blocking any connection attempts based on what's available at the database level, aren't we under-utilizing for API calls that need to do a lot of other things besides DB access? The answer is that may very well be true! Which makes the guidance more complicated based on what service we are talking about. So here, my guidance is oriented towards those Openstack services that are primarily doing database access as their primary work. Given the above caveat, I'm hoping people can look at this and verify my assumptions and the results. Assuming I am not just drunk on eggnog, what would my recommendations be? Basically: 1. at least for DB-oriented services, the number of 1000 greenlets should be *way* *way* lower, and we most likely should allow for a lot more connections to be used temporarily within a particular worker, which means I'd take the max_overflow setting and default it to like 50, or 100. The Greenlet number should then be very similar to the max_overflow number, and maybe even a little less, as Nova API calls right now often will use more than one connection concurrently. 2. longer term, let's please drop the eventlet pool thing and just use a real web server! (but still tune the connection pool appropriately). A real web server will at least know how to efficiently direct requests to worker processes. If all Openstack workers were configurable under a single web server config, that would also be a nice way to centralize tuning and profiling overall. Thanks for reading! __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
