Hi all,
TL;DR: low-level discussion of error-handling, and entity state; should
we change what we set the entity's "expected state" to when there are
errors?
---
I'm trying to fix https://issues.apache.org/jira/browse/BROOKLYN-534
(when stopping, ensure the `ServiceRestarter` doesn't restart the
entity, provisioning a new machine, in the situation where stop fails
for some reason).
See the failing test I've added in [1].
The problem is that, when the `location.release(machine)` fails, it sets
the entity's *expected state* to "on-fire". This means we've lost all
record of whether we wanted to be stopped or running. The
ServiceRetarter therefore (sensibly?!) responds to the "on-fire" by
restarting it!
Background: software prcoess entities normally have an "expected state"
saying what state they want to be in, and an "actual service state"
saying what it really is. The latter is inferred from a combination of
the expected state, whether the service.isUp, and also any explicit
"problem indicators" that have been set.
---
Options for fixing this include:
1. If the expected state is "on-fire", then don't do anything in such
policies - instead rely on whatever set that expected state to deal
with it (e.g. if something else called stop, which failed, then it
should deal with the error itself).
2. Don't set expected state to on-fire. Instead keep the expected state
as what we really want it to be, but include an appropriate "problem
indicator" (e.g. to say that "stop" failed).
Option (1) would probably be easy, but it's unclear where you'd hook in
logic to deal with such failures. It also feels wrong that we've lost
all knowledge of what we were trying to do.
Option (2) feels best. However, this stuff is fiddly. We'd need to make
the sure the "problem indicator" is cleared correctly when an entity
tries again to start or stop. If it's not cleared, then the entity will
likely never report itself as recovered from that problem. For example,
see all the calls to ServiceStateLogic.setExpectedState() [3].
Is it enough to change [2] (to set a "problem indicator", and to set the
expected state to "stopped"), and to clear that problem indicator in
start [4] and restart [5]?
Before I go down this road, does anyone have comments, questions, advice
or strong opinions?
Aled
[1] https://github.com/apache/brooklyn-server/pull/829
[2]
https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L869
(in the doStop method's catch block)
[3]
https://github.com/apache/brooklyn-server/blob/master/core/src/main/java/org/apache/brooklyn/core/entity/lifecycle/ServiceStateLogic.java#L158
[4]
https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L339
[5]
https://github.com/apache/brooklyn-server/blob/master/software/base/src/main/java/org/apache/brooklyn/entity/software/base/lifecycle/MachineLifecycleEffectorTasks.java#L661