Re: How to unset the ON_FIRE state?

Alex Heneveld Thu, 31 Jul 2014 08:19:35 -0700


Hi Svet, All,

Good time to raise this. I've been wondering about a similar thing, andmentioned this at #101. i'd like to see a way that new requirements forservice_up and service_state can easily be added by third parties.

Currently we explicitly attach a "computeServiceUp" computation at a fewentities (e.g. to say service is up iff service_state = running AND RESTto /foo returns 200). But it this is ad hoc, and it does not easilythird party updates (in particular clearing problems cleanly, ie havingindependent problem detectors and clearers). A related ambiguity is inSERVICE_STATE which is a combination of expected state together withbeing on_fire for some problems.


I'd like to suggest:

1) We add a SERVICE_NOT_UP_INDICATORS *map* sensor

2) We attach an enricher which sets SERVICE_UP based onSERVICE_NOT_UP_INDICATORS.isEmpty()

Then up-ness is controlled by effectors and policies which add andremove SERVICE_NOT_UP_INDICATORS, keyed by an identifier unique tothem. For instance all clusters and fabrics would simply subscribe tochildren's UP events and add such an indicator object the under"cluster.size" key if there are not a sufficient number of UP children.(Incidentally this would solve an issue where cluster health is notalways cleared appropriately when nodes come back online.) The existingisRunning checks for SoftwareProcess entities would also add such anindicator if it is detected as not running.


And we do something similar for SERVICE_STATE:

Introduce a SERVICE_PROBLEMS *map* attribute and an enricher which setsSERVICE_STATE based on the problems being empty and the value of newsensor SERVICE_STATE_EXPECTED. SERVICE_STATE_EXPECTED is set by thelifecycle tasks, and then: if a service is expected starting orstopping that is shown as SERVICE_STATE, otherwise if!SERVICE_PROBLEMS.isEmpty() it is set as ON_FIRE, otherwise it is setbased on SERVICE_STATE_EXPECTED and SERVICE_UP. Also we could have anenricher which puts a SERVICE_PROBLEM if`(SERVICE_STATE_EXPECTED==RUNNING && SERVICE_UP==false)`.

This is a touch more complicated than SERVICE_UP but I think it would beclearer and could simplify some of the "isRunning" logic checks duringpost-start. Where we want to wait on multiple things to determineup-ness, we can insert a SERVICE_NOT_UP_INDICATOR manually, then waitfor the appropriate enricher/feed to clear it. And it could handle thecase where a subscription should be responsible for the final transitionto EXPECTED=RUNNING (there are a few cases where start will set RUNNINGearly, and a subscription comes along later and finishes the job, aftersensors have been emitted). And of course it would support Svet's usecase where the "abc-compliance" policy would simply add an entry {abc-compliance: "Replication violation" } to the SERVICE_PROBLEMS, andclear it if it becomes okay -- and service state is automaticallyupdated to be ON_FIRE when there is a compliance problem.

Finally for tracking enrichers, sensor feeds, subscriptions, andpolicies, I suggest we add an optional "uniqueName", the presence ofwhich blocks the addition of something of the same kind with the sameuniqueName. This will better solve the problem described in #101, andit gives us a way to allow code to find and/or remove some of theenrichers above if they need to customize logic.


Best
Alex


On 31/07/2014 05:34, Svetoslav Neykov wrote:

Hi,

It seems that there is no way to unset an ON_FIRE state previously set by my
code. First it is not clear what the new state should be and second some
other code could've set the state as well meanwhile.

Here is some background. I am developing sample policies which monitor
machines for compliance with certain rules. If the rule is broken the
machine should be set ON_FIRE. So far so good. The problem is that once the
machines are back in compliant state I need to clear the error.

The ON_FIRE state in Lifecycle seems orthogonal to the rest of the states.
Logically we can have ON_FIRE while RUNNING or STARTING. It could be a
temporary error, not a final state in the state machine.

Just as an observation, we could have an entity ON_FIRE and SERVICE_UP at
the same time.

Possible solutions to the ON_FIRE issue could be:

*        Forbidding manual setting of ON_FIRE state, instead creating a
mechanism to register functions returning the state. By default it would be
SERVICE_STATE == RUNNING. The cons is that it is a poll-based approach.

*        Reference counting the setting of ON_FIRE. The cons is that it is
requires tedious housekeeping, leading to bugs.

Perhaps a combination of both approaches would be best - use the first one
with a long poll, with the ability to trigger the check manually.

Any thoughts?

Best,

Svet.

Re: How to unset the ON_FIRE state?

Reply via email to