Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800: > Hi, > > I would like the discuss various approaches towards fixing bug > https://launchpad.net/bugs/1533176 > > When convergence is on, and if the stack is stuck, there is no way to > cancel the existing request. This feature was not implemented in > convergence, as the user can again issue an update on an in-progress > stack. But if a resource worker is stuck, the new update will wait > for-ever on it and the update will not be effective. > > The solution is to implement cancel request. Since the work for a stack > is distributed among heat engines, the cancel request will not work as > it does in legacy way. Many or all of the heat engines might be running > worker threads to provision a stack. > > I could think of two options which I would like to discuss: > > (a) When a user triggered cancel request is received, set the stack > current traversal to None or something else other than current > traversal. With this the new check-resources/workers will never be > triggered. This is okay as long as the worker(s) is not stuck. The > existing workers will finish running, and no new check-resource > (workers) will be triggered, and it will be a graceful cancel. But the > workers that are stuck will be stuck for-ever till stack times-out. To > take care of such cases, we will have to implement logic of "polling" > the DB at regular intervals (may be at each step() of scheduler task) > and bail out if the current traversal is updated. Basically, each worker > will "poll" the DB to see if the current traversal is still valid and if > not, stop itself. The drawback of this approach is that all the workers > will be hitting the DB and incur a significant overhead. Besides, all > the stack workers irrespective of whether they will be cancelled or not, > will keep on hitting DB. The advantage is that it probably is easier to > implement. Also, if the worker is stuck in particular "step", then this > approach will not work. > > (b) Another approach is to send cancel message to all the heat engines > when one receives a stack cancel request. The idea is to use the thread > group manager in each engine to keep track of threads running for a > stack, and stop the thread group when a cancel message is received. The > advantage is that the messages to cancel stack workers is sent only when > required and there is no other over-head. The draw-back is that the > cancel message is 'broadcasted' to all heat engines, even if they are > not running any workers for the given stack, though, in such cases, it > will be a just no-op for the heat-engine (the message will be gracefully > discarded).
Oh hah, I just sent (b) as an option to avoid (a) without really thinking about (b) again. I don't think the cancel broadcasts are all that much of a drawback. I do think you need to rate limit cancels though, or you give users the chance to DDoS the system. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
