Re: [openstack-dev] [heat] convergence cancel messages

Zane Bitter Fri, 15 Apr 2016 12:51:43 -0700

On 15/04/16 10:58, Anant Patil wrote:

On 14-Apr-16 23:09, Zane Bitter wrote:

On 11/04/16 04:51, Anant Patil wrote:

After lot of ping-pong in my head, I have taken a different approach to
implement stack-update-cancel when convergence is on. Polling for
traversal update in each heat engine worker is not efficient method and
so is the broadcasting method.


In the new implementation, when a stack-cancel-update request is
received, the heat engine worker will immediately cancel eventlets
running locally for the stack. Then it sends cancel messages to only
those heat engines who are working on the stack, one request per engine.


I'm concerned that this is forgetting the reason we didn't implement
this in convergence in the first place. The purpose of
stack-cancel-update is to roll the stack back to its pre-update state,
not to unwedge blocked resources.


Yes, we thought this was never needed because we consciously decided
that the concurrent update feature would suffice the needs of user.
Exactly the reason for me to implement this so late. But there were
questions for API compatibility, and what if user really wants to cancel
the update, given that he/she knows the consequence of it?


Cool, we are on the same page then :)

The problem with just killing a thread is that the resource gets left in
an unknown state. (It's slightly less dangerous if you do it only during
sleeps, but still the state is indeterminate.) As a result, we mark all
such resources UPDATE_FAILED, and anything (apart from nested stacks) in
a FAILED state is liable to be replaced on the next update (straight
away in the case of a rollback). That's why in convergence we just let
resources run their course rather than cancelling them, and of course we
are able to do so because they don't block other operations on the stack
until they reach the point of needing to operate on that particular
resource.


The eventlet returns after each "step", so it's not that bad, but I do

Yeah, I saw you implemented it that way, and this is a *big*improvement. That will help avoid bugs likehttp://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html

agree that the resource might not be in a state from where it can
"resume", and hence the update-replace.

The issue is that Heat *always* moves the resource to FAILED andtherefore it is *always* replaced in the future, even if it would havecompleted fine.

So doing some trivial change that is guaranteed to happen in-place couldresult in your critical resource that must never be replaced (e.g.Cinder volume) being replaced if you happen to cancel the update at justthe wrong moment.

I acknowledge your concern here,
but I see that the user really knows that the stack is stuck because of
some unexpected failure which heat is not aware of, and wants to cancel
it.

I think there's two different use cases here: (1) just stop the updateand don't start updating any more resources (and maybe roll back whathas already been done); and (2) kill the update on this resource that isstuck. Using the same command for both is likely to cause trouble forpeople who were only wanting the first one.

The other option would be to have stack-cancel-update just do (1) bydefault, but add a --cancel-me-harder option that also does (2).

That leaves the problem of what to do when you _know_ a resource is
going to fail, you _want_ to replace it, and you don't want to wait for
the stack timeout. (In theory this problem will go away when Phase 2 of
convergence is fully implemented, but I agree we need a solution for
Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
like a better candidate for the functionality to stop threads than
stack-cancel-update is, since its entire purpose in life is to set a
resource into a FAILED state so that it will get replaced on the next
stack update.

So from a user's perspective, they would issue stack-cancel-update to
start the rollback, and iff that gets stuck waiting on a resource that
is doomed to fail eventually and which they just want to replace, they
can issue resource-mark-unhealthy to just stop that resource.


I was thinking of having the rollback optional while cancelling the
update. The user may want to cancel the update and issue a new one, but
not rollback.

+1, this is a good idea. I originally thought that you'd never want toleave the stack in an intermediate state, but experience with TripleO(which can't really do rollbacks) is that sometimes you really do justwant to hit the panic button and stop the world :D

What do you think?


I think it is a good idea, but I see that a resource can be marked
unhealthy only after it is done.

Currently, yes. The idea would be to change that so that if it finds theresource IN_PROGRESS then it kills the thread and makes sure theresource is in a FAILED state. I imagine/hope it wouldn't require bigchanges to your patch, mostly just changing where it's triggered from.

The trick would be if the stack update is still running and the resourceis currently IN_PROGRESS to make sure that we fail the whole stackupdate (rolling back if the user has enabled that).

The cancel update would take care of
in-progress resources gone bad. I really thought the mark-unhealthy and
stack-cancel-update were complementing features than contradictory.

I'm relaxed about whether this is implemented as part of themark-unhealthy or as a non-default option to cancel-update. The mainthing is not to put IN_PROGRESS resources into a FAILED state by defaultwhenever the user cancels an update.

Reusing mark-unhealthy as the trigger for this functionality seemedappealing because it already has basically the semantics that are goingto get (tell Heat to replace this resource on the next update) so thereshould be no surprises for users, and because it offers fine-grainedcontrol (at the resource level rather than the stack level).


cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] convergence cancel messages

Reply via email to