I was also thinking of using the environment to hold the breakpoint, similarly to parameters. The CLI and API would process it just like parameters.
As for the state of a stack hitting the breakpoint, leveraging the FAILED state seems to be sufficient, we just need to add enough information to differentiate between a failed resource and a resource at a breakpoint. Something like emitting an event or a message should be enough to make that distinction. Debugger for native program typically does the same thing, leveraging the exception handling in the OS by inserting an artificial error at the breakpoint to force a program to stop. Then the debugger would just remember the address of these artificial errors to decode the state of the stopped program. As for the workflow, instead of spinning in the scheduler waiting for a signal, I was thinking of moving the stack off the engine as a failed stack. So this would be an end-state for the stack as Steve suggested, but without adding a new stack state. Again, this is similar to how a program being debugged is handled: they are moved off the ready queue and their context is preserved for examination. This seems to keep the implementation simple and we don't have to worry about timeout, performance, etc. Continuing from the breakpoint then should be similar to stack-update on a failed stack. We do need some additional handling, such as allowing resource in-progress to run to completion instead of aborting. For the parallel paths in a template, I am thinking about these alternatives: 1. Stop after all the current in-progress resources complete, but do not start any new resources even if there is no dependency. This should be easier to implement, but the state of the stack would be non-deterministic. 2. Stop only the paths with the breakpoint, continue all other parallel paths to completion. This seems harder to implement, but the stack would be in a deterministic state and easier for the user to reason with. To be compatible with convergence, I had suggested to Clint earlier to add a mode where the convergence engine does not attempt to retry so the user can debug, and I believe this was added to the blueprint. Ton, From: Steven Hardy <sha...@redhat.com> To: "OpenStack Development Mailing List (not for usage questions)" <firstname.lastname@example.org> Date: 01/12/2015 02:40 PM Subject: Re: [openstack-dev] [Heat] Where to keep data about stack breakpoints? On Mon, Jan 12, 2015 at 05:10:47PM -0500, Zane Bitter wrote: > On 12/01/15 13:05, Steven Hardy wrote: > >>>I also had a chat with Steve Hardy and he suggested adding a STOPPED state > >>>to the stack (this isn't in the spec). While not strictly necessary to > >>>implement the spec, this would help people figure out that the stack has > >>>reached a breakpoint instead of just waiting on a resource that takes a long > >>>time to finish (the heat-engine log and event-list still show that a > >>>breakpoint was reached but I'd like to have it in stack-list and > >>>resource-list, too). > >>> > >>>It makes more sense to me to call it PAUSED (we're not completely stopping > >>>the stack creation after all, just pausing it for a bit), I'll let Steve > >>>explain why that's not the right choice :-). > >So, I've not got strong opinions on the name, it's more the workflow: > > > >1. User triggers a stack create/update > >2. Heat walks the graph, hits a breakpoint and stops. > >3. Heat explicitly triggers continuation of the create/update > > Did you mean the user rather than Heat for (3)? Oops, yes I did. > >My argument is that (3) is always a stack update, either a PUT or PATCH > >update, e.g we_are_ completely stopping stack creation, then a user can > >choose to re-start it (either with the same or a different definition). > > Hmmm, ok that's interesting. I have not been thinking of it that way. I've > always thought of it like this: > > http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/adding-lifecycle-hooks.html > > (Incidentally, this suggests an implementation where the lifecycle hook is > actually a resource - with its own API, naturally.) > > So, if it's requested, before each operation we send out a notification > (hopefully via Zaqar), and if a breakpoint is set that operation is not > carried out until the user makes an API call acknowledging it. I guess I was trying to keep it initially simpler than that, given that we don't have any integration with a heat-user messaging system at present. > >So, it_is_ really an end state, as a user might never choose to update > >from the stopped state, in which case *_STOPPED makes more sense. > > That makes a bit more sense now. > > I think this is going to be really hard to implement though. Because while > one branch of the graph stops, other branches have to continue as far as > they can. At what point do you change the state of the stack? True, this is a disadvantage of specifying a single breakpoint when there may be parallel paths through the graph. However, I was thinking we could just reuse our existing error path implementation, so it needn't be hard to implement at all, e.g. 1. Stack action started where a resource has a breakpoint set 2. Stack.stack_task.resource_action checks if resource is a breakpoint 3. If a breakpoint is set, we raise a exception.ResourceFailure subclass 4. The normal error_wait_time is respected, e.g currently in-progress actions are given a chance to complete. Basically, the only implementation would be raising a special new type of exception, which would enable a suitable message (and event) to be shown to the user "Stack create aborted due to breakpoint on resource foo". Pre/post breakpoint actions/messaging could be added later via a similar method to the stack-level lifecycle plugin hooks. If folks are happy with e.g CREATE_FAILED as a post-breakpoint state, this could simplify things a lot, as we'd not need any new state or much new code at all? > >Paused implies the same action as the PATCH update, only we trigger > >continuation of the operation from the point we reached via some sort of > >user signal. > > > >If we actually pause an in-progress action via the scheduler, we'd have to > >start worrying about stuff like token expiry, hitting timeouts, resilience > >to engine restarts, etc, etc. So forcing an explicit update seems simpler > >to me. > > Yes, token expiry and stack timeouts are annoying things we'd have to deal > with. (Resilience to engine restarts is not affected though.) However, I'm > not sure your model is simpler, and in particular it sounds much harder to > implement in the convergence architecture. So you're advocating keeping the scheduler spinning, until a user sends a signal to the resource to clear the breakpoint? I don't see why we couldn't do both, have a "abort_on_breakpoint" flag or something, but I'd be interested in further understanding how the error-path approach outlined above would be incompatible with convergence. Thanks, Steve __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev