I'd vote for option 2 for sure, it's the one that respects the meaning of "expected state". The code at [2] is more or less misusing the expected state, saying "I know I should be stopped, but something went wrong, so now I expect to be on-fire". This is counter to the design of the relationship between expected and actual state, so it should instead be setting a problem indicator.
I agree this may be a bit more fiddly than option 1; perhaps this could be simplified a bit by introducing a couple more convenience wrapper methods in 'ServiceStateLogic', like 'commenceStart', 'commenceStop', 'markStartFailed', 'markStopFailed', which would both set the expected state correctly and set or clear a well-known problem indicator for just this situation. Geoff On Wed, 27 Sep 2017 at 15:06 Thomas Bouron <[email protected]> wrote: > Hi Aled. > > Option 1 is definitely simpler. While it let the author decide what to do, > it means that each policy has to have an ad-hoc behaviour based on what the > "expected state" value is, which might not reflect the real expected state. > As you said in you example, the expected state is "on-fire" which should > never be IMO. It should always reflect... well, the expected state so > either (starting|started|stopping|stopped). > > This is why I'm leaning toward option 2. I appreciate that it can (will?) > be fiddly to implement this change across all entities/enrichers/policies > but I think it is worth the effort. Also, this means that we will have much > better information about what went wrong, the UI and CLI will benefit from > it (this obviously assume we do some work there to make those "problem > indicators" more visible) > > Best. > > On Mon, 18 Sep 2017 at 16:07 Aled Sage <[email protected]> wrote: > > > Hi all, > > > > TL;DR: low-level discussion of error-handling, and entity state; should > > we change what we set the entity's "expected state" to when there are > > errors? > > > > --- > > I'm trying to fix https://issues.apache.org/jira/browse/BROOKLYN-534 > > (when stopping, ensure the `ServiceRestarter` doesn't restart the > > entity, provisioning a new machine, in the situation where stop fails > > for some reason). > > > > See the failing test I've added in [1]. > > > > The problem is that, when the `location.release(machine)` fails, it sets > > the entity's *expected state* to "on-fire". This means we've lost all > > record of whether we wanted to be stopped or running. The > > ServiceRetarter therefore (sensibly?!) responds to the "on-fire" by > > restarting it! > > > > Background: software prcoess entities normally have an "expected state" > > saying what state they want to be in, and an "actual service state" > > saying what it really is. The latter is inferred from a combination of > > the expected state, whether the service.isUp, and also any explicit > > "problem indicators" that have been set. > > > > --- > > Options for fixing this include: > > > > 1. If the expected state is "on-fire", then don't do anything in such > > policies - instead rely on whatever set that expected state to deal > > with it (e.g. if something else called stop, which failed, then it > > should deal with the error itself). > > 2. Don't set expected state to on-fire. Instead keep the expected state > > as what we really want it to be, but include an appropriate "problem > > indicator" (e.g. to say that "stop" failed). > > > > Option (1) would probably be easy, but it's unclear where you'd hook in > > logic to deal with such failures. It also feels wrong that we've lost > > all knowledge of what we were trying to do. > > > > Option (2) feels best. However, this stuff is fiddly. We'd need to make > > the sure the "problem indicator" is cleared correctly when an entity > > tries again to start or stop. If it's not cleared, then the entity will > > likely never report itself as recovered from that problem. For example, > > see all the calls to ServiceStateLogic.setExpectedState() [3]. > > > > Is it enough to change [2] (to set a "problem indicator", and to set the > > expected state to "stopped"), and to clear that problem indicator in > > start [4] and restart [5]? > > > > Before I go down this road, does anyone have comments, questions, advice > > or strong opinions? > > > > Aled > > > > [1] https://github.com/apache/brooklyn-server/pull/829 > > [2] > > > > > https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L869 > > (in the doStop method's catch block) > > [3] > > > > > https://github.com/apache/brooklyn-server/blob/master/core/src/main/java/org/apache/brooklyn/core/entity/lifecycle/ServiceStateLogic.java#L158 > > [4] > > > > > https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L339 > > [5] > > > > > https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L661 > > > > -- > > Thomas Bouron • Senior Software Engineer @ Cloudsoft Corporation • > https://cloudsoft.io/ > Github: https://github.com/tbouron > Twitter: https://twitter.com/eltibouron >
