Re: [hackathon] health checks - response status values

Georg Henzler Fri, 12 Oct 2018 07:46:49 -0700

Hi all,

... "WARN" is always ambigous. Coming with an "old-school" IT
operation background, I would interprete WARN as "it's still workingand
can be used, but we should have a closer look at it".
* WARN is indeed confusing for the LB case - is the instance ready/aliveor not? That's why we went for GREEN/ YELLOW/ RED. So for us, WARN mapstoYELLOW but the naming makes the difference clearer: YELLOW is "notready
yet but it's a matter of time" and RED is "yeah, this isn't going to be
ready without manual intervention".

This is an interesting point. I agree HealthCheck.WARN is different fromSystemReady.YELLOW today. It could make sense to to introduce a newstatus TEMP_CRITICAL for this:


TEMP_CRITICAL -> CRITICAL with tendency to OK (system not functional)
WARN -> OK with tendency to CRITICAL (system fully functional)

I created a table [1] in the wiki to make this clearer.

Regarding the response token names in general (e.g. OK vs GREEN orCRITICAL vs. RED): In the end it's just a name, machine clients will usethe mapped http response code 503 to that name (so the name does notmatter too much). However I do think that CRITICAL or OK is moreexpressive than GREEN or RED and changing the name would make migrationfor existing health checks harder. YELLOW does not clearly tell "notusable on the way to usable", here TEMP_CRITICAL would make thatimmediately clear.

But then again, introducing another status might not be KISS, speakingfor the kubernetes use case it would also be enough to just implementthe current SystemReady.YELLOW to return HealthCheck.CRITICAL.


Leaves us to two options:
* Just use HealthCheck.CRITICAL for the startup case in Kubernetes

* Introduce TEMP_CRITICAL to signal CRITICAL with tendency to OK(whereas WARN is OK with tendency to CRITICAL)

But whatever we choose, we should make sure we include a table like [1]in the official API documentation to align all health check implementersas much as possible to the expectation their response status will have.


-Georg

[1]https://cwiki.apache.org/confluence/display/SLING/Health+Check+Response+Status+Values



On 2018-09-28 20:28, Andrei Dulvac wrote:

Hi Jörg.

This is where systemready is a bit different:
* We have a disable config on our checks - this can definitely beimproved
and maybe have that in the monitor as currently it's just a way we
implemented the individual checks.

 "WARN" _could_ mean that, but that's

usually not what it means, at least not in any tool I've seen.

The monitoring part is something I think needs to be treaded carefully:
Yes, we can feed this into a monitoring tool, but I would not make theHCs
or systemready or whatever comes of the two a tool for providing
quantitative data, just values for binary (tertiary I guess) metrics
(qualitative info).
I agree with Christian that this might be the best opportunity toreviewsome of the design choices and (personal preference alert!) maybe splititinto modules with slightly different concerns. We're anyway going tohavethe sling mapping to the new SPI in felix so we havebackward-compatibility.
Yes, there's a tradeoff, but let's talk about it.

- Andrei
On Fri, Sep 28, 2018 at 7:40 PM Jörg Hoh<jhoh...@googlemail.com.invalid>
wrote:
I don't want to revive this discussion, but just wanted to give someideasabout my ideas when I initially started this with Bertrand(accidentally we
did that together on an adapTo() some years ago).
* the idea was always to use the healthchecks to capture theapplicationstate and make it usable for consumption by a loadbalancer or anyother
external monitoring system. Implementing other checks (like checks for
security measures being implemented) are also possible, but they havenever
been the primary usecase.
* the way how the reporting states "OK", "WARN" and "CRITICAL" are
interpreted, is totally up to the developer implementing thehealthchecksand the team operating the system. While "OK" and "CRITICAL" seemquite

Nevertheless, the

developer of healthchecks should have the same understanding.
* The idea was always that it should be possible to change thesettingsduring runtime manually; either to override accidentally incorrectsettingsor to handle unforseen situations; removing a misbehaving check fromthestate calculation (manually, without deployment) is definitly ausecase
which should be supported.

Jörg

Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
sseif...@pro-vision.de>:

> - currently there is some overlap between sling health checks and the new
> felix system readyness framework presented [1]
> - the idea is to bring this together within felix
> - provide a facade for the sling healthcheck API for backwards
> compatibility
>
> stefan
>
> [1]
>
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
>
>
>

--
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh

Re: [hackathon] health checks - response status values

Reply via email to