On 15/04/16 10:58, Anant Patil wrote:
On 14-Apr-16 23:09, Zane Bitter wrote:
On 11/04/16 04:51, Anant Patil wrote:
After lot of ping-pong in my head, I have taken a different approach to
implement stack-update-cancel when convergence is on. Polling for
traversal update in each heat engine worker is not efficient method and
so is the broadcasting method.

In the new implementation, when a stack-cancel-update request is
received, the heat engine worker will immediately cancel eventlets
running locally for the stack. Then it sends cancel messages to only
those heat engines who are working on the stack, one request per engine.

I'm concerned that this is forgetting the reason we didn't implement
this in convergence in the first place. The purpose of
stack-cancel-update is to roll the stack back to its pre-update state,
not to unwedge blocked resources.


Yes, we thought this was never needed because we consciously decided
that the concurrent update feature would suffice the needs of user.
Exactly the reason for me to implement this so late. But there were
questions for API compatibility, and what if user really wants to cancel
the update, given that he/she knows the consequence of it?

Cool, we are on the same page then :)

The problem with just killing a thread is that the resource gets left in
an unknown state. (It's slightly less dangerous if you do it only during
sleeps, but still the state is indeterminate.) As a result, we mark all
such resources UPDATE_FAILED, and anything (apart from nested stacks) in
a FAILED state is liable to be replaced on the next update (straight
away in the case of a rollback). That's why in convergence we just let
resources run their course rather than cancelling them, and of course we
are able to do so because they don't block other operations on the stack
until they reach the point of needing to operate on that particular
resource.


The eventlet returns after each "step", so it's not that bad, but I do

Yeah, I saw you implemented it that way, and this is a *big* improvement. That will help avoid bugs like http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html

agree that the resource might not be in a state from where it can
"resume", and hence the update-replace.

The issue is that Heat *always* moves the resource to FAILED and therefore it is *always* replaced in the future, even if it would have completed fine.

So doing some trivial change that is guaranteed to happen in-place could result in your critical resource that must never be replaced (e.g. Cinder volume) being replaced if you happen to cancel the update at just the wrong moment.

I acknowledge your concern here,
but I see that the user really knows that the stack is stuck because of
some unexpected failure which heat is not aware of, and wants to cancel
it.

I think there's two different use cases here: (1) just stop the update and don't start updating any more resources (and maybe roll back what has already been done); and (2) kill the update on this resource that is stuck. Using the same command for both is likely to cause trouble for people who were only wanting the first one.

The other option would be to have stack-cancel-update just do (1) by default, but add a --cancel-me-harder option that also does (2).

That leaves the problem of what to do when you _know_ a resource is
going to fail, you _want_ to replace it, and you don't want to wait for
the stack timeout. (In theory this problem will go away when Phase 2 of
convergence is fully implemented, but I agree we need a solution for
Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
like a better candidate for the functionality to stop threads than
stack-cancel-update is, since its entire purpose in life is to set a
resource into a FAILED state so that it will get replaced on the next
stack update.

So from a user's perspective, they would issue stack-cancel-update to
start the rollback, and iff that gets stuck waiting on a resource that
is doomed to fail eventually and which they just want to replace, they
can issue resource-mark-unhealthy to just stop that resource.


I was thinking of having the rollback optional while cancelling the
update. The user may want to cancel the update and issue a new one, but
not rollback.

+1, this is a good idea. I originally thought that you'd never want to leave the stack in an intermediate state, but experience with TripleO (which can't really do rollbacks) is that sometimes you really do just want to hit the panic button and stop the world :D

What do you think?


I think it is a good idea, but I see that a resource can be marked
unhealthy only after it is done.

Currently, yes. The idea would be to change that so that if it finds the resource IN_PROGRESS then it kills the thread and makes sure the resource is in a FAILED state. I imagine/hope it wouldn't require big changes to your patch, mostly just changing where it's triggered from.

The trick would be if the stack update is still running and the resource is currently IN_PROGRESS to make sure that we fail the whole stack update (rolling back if the user has enabled that).

The cancel update would take care of
in-progress resources gone bad. I really thought the mark-unhealthy and
stack-cancel-update were complementing features than contradictory.

I'm relaxed about whether this is implemented as part of the mark-unhealthy or as a non-default option to cancel-update. The main thing is not to put IN_PROGRESS resources into a FAILED state by default whenever the user cancels an update.

Reusing mark-unhealthy as the trigger for this functionality seemed appealing because it already has basically the semantics that are going to get (tell Heat to replace this resource on the next update) so there should be no surprises for users, and because it offers fine-grained control (at the resource level rather than the stack level).

cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to