I'd vote for option 2 for sure, it's the one that respects the meaning of
"expected state".  The code at [2] is more or less misusing the expected
state, saying "I know I should be stopped, but something went wrong, so now
I expect to be on-fire".  This is counter to the design of the relationship
between expected and actual state, so it should instead be setting a
problem indicator.

I agree this may be a bit more fiddly than option 1; perhaps this could be
simplified a bit by introducing a couple more convenience wrapper methods
in 'ServiceStateLogic', like 'commenceStart', 'commenceStop',
'markStartFailed', 'markStopFailed', which would both set the expected
state correctly and set or clear a well-known problem indicator for just
this situation.

Geoff



On Wed, 27 Sep 2017 at 15:06 Thomas Bouron <[email protected]>
wrote:

> Hi Aled.
>
> Option 1 is definitely simpler. While it let the author decide what to do,
> it means that each policy has to have an ad-hoc behaviour based on what the
> "expected state" value is, which might not reflect the real expected state.
> As you said in you example, the expected state is "on-fire" which should
> never be IMO. It should always reflect... well, the expected state so
> either (starting|started|stopping|stopped).
>
> This is why I'm leaning toward option 2. I appreciate that it can (will?)
> be fiddly to implement this change across all entities/enrichers/policies
> but I think it is worth the effort. Also, this means that we will have much
> better information about what went wrong, the UI and CLI will benefit from
> it (this obviously assume we do some work there to make those "problem
> indicators" more visible)
>
> Best.
>
> On Mon, 18 Sep 2017 at 16:07 Aled Sage <[email protected]> wrote:
>
> > Hi all,
> >
> > TL;DR: low-level discussion of error-handling, and entity state; should
> > we change what we set the entity's "expected state" to when there are
> > errors?
> >
> > ---
> > I'm trying to fix https://issues.apache.org/jira/browse/BROOKLYN-534
> > (when stopping, ensure the `ServiceRestarter` doesn't restart the
> > entity, provisioning a new machine, in the situation where stop fails
> > for some reason).
> >
> > See the failing test I've added in [1].
> >
> > The problem is that, when the `location.release(machine)` fails, it sets
> > the entity's *expected state* to "on-fire".  This means we've lost all
> > record of whether we wanted to be stopped or running. The
> > ServiceRetarter therefore (sensibly?!) responds to the "on-fire" by
> > restarting it!
> >
> > Background: software prcoess entities normally have an "expected state"
> > saying what state they want to be in, and an "actual service state"
> > saying what it really is. The latter is inferred from a combination of
> > the expected state, whether the service.isUp, and also any explicit
> > "problem indicators" that have been set.
> >
> > ---
> > Options for fixing this include:
> >
> >  1. If the expected state is "on-fire", then don't do anything in such
> >     policies - instead rely on whatever set that expected state to deal
> >     with it (e.g. if something else called stop, which failed, then it
> >     should deal with the error itself).
> >  2. Don't set expected state to on-fire. Instead keep the expected state
> >     as what we really want it to be, but include an appropriate "problem
> >     indicator" (e.g. to say that "stop" failed).
> >
> > Option (1) would probably be easy, but it's unclear where you'd hook in
> > logic to deal with such failures. It also feels wrong that we've lost
> > all knowledge of what we were trying to do.
> >
> > Option (2) feels best. However, this stuff is fiddly. We'd need to make
> > the sure the "problem indicator" is cleared correctly when an entity
> > tries again to start or stop. If it's not cleared, then the entity will
> > likely never report itself as recovered from that problem. For example,
> > see all the calls to ServiceStateLogic.setExpectedState() [3].
> >
> > Is it enough to change [2] (to set a "problem indicator", and to set the
> > expected state to "stopped"), and to clear that problem indicator in
> > start [4] and restart [5]?
> >
> > Before I go down this road, does anyone have comments, questions, advice
> > or strong opinions?
> >
> > Aled
> >
> > [1] https://github.com/apache/brooklyn-server/pull/829
> > [2]
> >
> >
> https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L869
> > (in the doStop method's catch block)
> > [3]
> >
> >
> https://github.com/apache/brooklyn-server/blob/master/core/src/main/java/org/apache/brooklyn/core/entity/lifecycle/ServiceStateLogic.java#L158
> > [4]
> >
> >
> https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L339
> > [5]
> >
> >
> https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L661
> >
> > --
>
> Thomas Bouron • Senior Software Engineer @ Cloudsoft Corporation •
> https://cloudsoft.io/
> Github: https://github.com/tbouron
> Twitter: https://twitter.com/eltibouron
>

Reply via email to