Hi all,

TL;DR: low-level discussion of error-handling, and entity state; should we change what we set the entity's "expected state" to when there are errors?

---
I'm trying to fix https://issues.apache.org/jira/browse/BROOKLYN-534 (when stopping, ensure the `ServiceRestarter` doesn't restart the entity, provisioning a new machine, in the situation where stop fails for some reason).

See the failing test I've added in [1].

The problem is that, when the `location.release(machine)` fails, it sets the entity's *expected state* to "on-fire".  This means we've lost all record of whether we wanted to be stopped or running. The ServiceRetarter therefore (sensibly?!) responds to the "on-fire" by restarting it!

Background: software prcoess entities normally have an "expected state" saying what state they want to be in, and an "actual service state" saying what it really is. The latter is inferred from a combination of the expected state, whether the service.isUp, and also any explicit "problem indicators" that have been set.

---
Options for fixing this include:

1. If the expected state is "on-fire", then don't do anything in such
   policies - instead rely on whatever set that expected state to deal
   with it (e.g. if something else called stop, which failed, then it
   should deal with the error itself).
2. Don't set expected state to on-fire. Instead keep the expected state
   as what we really want it to be, but include an appropriate "problem
   indicator" (e.g. to say that "stop" failed).

Option (1) would probably be easy, but it's unclear where you'd hook in logic to deal with such failures. It also feels wrong that we've lost all knowledge of what we were trying to do.

Option (2) feels best. However, this stuff is fiddly. We'd need to make the sure the "problem indicator" is cleared correctly when an entity tries again to start or stop. If it's not cleared, then the entity will likely never report itself as recovered from that problem. For example, see all the calls to ServiceStateLogic.setExpectedState() [3].

Is it enough to change [2] (to set a "problem indicator", and to set the expected state to "stopped"), and to clear that problem indicator in start [4] and restart [5]?

Before I go down this road, does anyone have comments, questions, advice or strong opinions?

Aled

[1] https://github.com/apache/brooklyn-server/pull/829
[2] https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L869 (in the doStop method's catch block) [3] https://github.com/apache/brooklyn-server/blob/master/core/src/main/java/org/apache/brooklyn/core/entity/lifecycle/ServiceStateLogic.java#L158 [4] https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L339 [5] https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L661

Reply via email to