On 03/09/15 02:56, Angus Salkeld wrote:
On Thu, Sep 3, 2015 at 3:53 AM Zane Bitter <[email protected]
<mailto:[email protected]>> wrote:

    On 02/09/15 04:55, Steven Hardy wrote:
     > On Wed, Sep 02, 2015 at 04:33:36PM +1200, Robert Collins wrote:
     >> On 2 September 2015 at 11:53, Angus Salkeld
    <[email protected] <mailto:[email protected]>> wrote:
     >>
     >>> 1. limit the number of resource actions in parallel (maybe base
    on the
     >>> number of cores)
     >>
     >> I'm having trouble mapping that back to 'and heat-engine is
    running on
     >> 3 separate servers'.
     >
     > I think Angus was responding to my test feedback, which was a
    different
     > setup, one 4-core laptop running heat-engine with 4 worker processes.
     >
     > In that environment, the level of additional concurrency becomes
    a problem
     > because all heat workers become so busy that creating a large stack
     > DoSes the Heat services, and in my case also the DB.
     >
     > If we had a configurable option, similar to num_engine_workers, which
     > enabled control of the number of resource actions in parallel, I
    probably
     > could have controlled that explosion in activity to a more
    managable series
     > of tasks, e.g I'd set num_resource_actions to
    (num_engine_workers*2) or
     > something.

    I think that's actually the opposite of what we need.

    The resource actions are just sent to the worker queue to get processed
    whenever. One day we will get to the point where we are overflowing the
    queue, but I guarantee that we are nowhere near that day. If we are
    DoSing ourselves, it can only be because we're pulling *everything* off
    the queue and starting it in separate greenthreads.


worker does not use a greenthread per job like service.py does.
This issue is if you have actions that are fast you can hit the db hard.

QueuePool limit of size 5 overflow 10 reached, connection timed out,
timeout 30

It seems like it's not very hard to hit this limit. It comes from simply
loading
the resource in the worker:
"/home/angus/work/heat/heat/engine/worker.py", line 276, in check_resource
"/home/angus/work/heat/heat/engine/worker.py", line 145, in _load_resource
"/home/angus/work/heat/heat/engine/resource.py", line 290, in load
resource_objects.Resource.get_obj(context, resource_id)

This is probably me being naive, but that sounds strange. I would have thought that there is no way to exhaust the connection pool by doing lots of actions in rapid succession. I'd have guessed that the only way to exhaust a connection pool would be to have lots of connections open simultaneously. That suggests to me that either we are failing to expeditiously close connections and return them to the pool, or that we are - explicitly or implicitly - processing a bunch of messages in parallel.

    In an ideal world, we might only ever pull one task off that queue at a
    time. Any time the task is sleeping, we would use for processing stuff
    off the engine queue (which needs a quick response, since it is serving
    the ReST API). The trouble is that you need a *huge* number of
    heat-engines to handle stuff in parallel. In the reductio-ad-absurdum
    case of a single engine only processing a single task at a time, we're
    back to creating resources serially. So we probably want a higher number
    than 1. (Phase 2 of convergence will make tasks much smaller, and may
    even get us down to the point where we can pull only a single task at a
    time.)

    However, the fewer engines you have, the more greenthreads we'll have to
    allow to get some semblance of parallelism. To the extent that more
    cores means more engines (which assumes all running on one box, but
    still), the number of cores is negatively correlated with the number of
    tasks that we want to allow.

    Note that all of the greenthreads run in a single CPU thread, so having
    more cores doesn't help us at all with processing more stuff in
    parallel.


Except, as I said above, we are not creating greenthreads in worker.

Well, maybe we'll need to in order to make things still work sanely with a low number of engines :) (Should be pretty easy to do with a semaphore.)

I think what y'all are suggesting is limiting the number of jobs that go into the queue... that's quite wrong IMO. Apart from the fact it's impossible (resources put jobs into the queue entirely independently, and have no knowledge of the global state required to throttle inputs), we shouldn't implement an in-memory queue with long-running tasks containing state that can be lost if the process dies - the whole point of convergence is we have... a message queue for that. We need to limit the rate that stuff comes *out* of the queue. And, again, since we have no knowledge of global state, we can only control the rate at which an individual worker processes tasks. The way to avoid killing the DB is to out a constant ceiling on the workers * concurrent_tasks_per_worker product.

cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to