Hi all,

... "WARN" is always ambigous. Coming with an "old-school" IT
operation background, I would interprete WARN as "it's still working and
can be used, but we should have a closer look at it".
* WARN is indeed confusing for the LB case - is the instance ready/ alive or not? That's why we went for GREEN/ YELLOW/ RED. So for us, WARN maps to YELLOW but the naming makes the difference clearer: YELLOW is "not ready
yet but it's a matter of time" and RED is "yeah, this isn't going to be
ready without manual intervention".

This is an interesting point. I agree HealthCheck.WARN is different from SystemReady.YELLOW today. It could make sense to to introduce a new status TEMP_CRITICAL for this:

TEMP_CRITICAL -> CRITICAL with tendency to OK (system not functional)
WARN -> OK with tendency to CRITICAL (system fully functional)

I created a table [1] in the wiki to make this clearer.

Regarding the response token names in general (e.g. OK vs GREEN or CRITICAL vs. RED): In the end it's just a name, machine clients will use the mapped http response code 503 to that name (so the name does not matter too much). However I do think that CRITICAL or OK is more expressive than GREEN or RED and changing the name would make migration for existing health checks harder. YELLOW does not clearly tell "not usable on the way to usable", here TEMP_CRITICAL would make that immediately clear.

But then again, introducing another status might not be KISS, speaking for the kubernetes use case it would also be enough to just implement the current SystemReady.YELLOW to return HealthCheck.CRITICAL.

Leaves us to two options:
* Just use HealthCheck.CRITICAL for the startup case in Kubernetes
* Introduce TEMP_CRITICAL to signal CRITICAL with tendency to OK (whereas WARN is OK with tendency to CRITICAL)

But whatever we choose, we should make sure we include a table like [1] in the official API documentation to align all health check implementers as much as possible to the expectation their response status will have.

-Georg

[1] https://cwiki.apache.org/confluence/display/SLING/Health+Check+Response+Status+Values


On 2018-09-28 20:28, Andrei Dulvac wrote:
Hi Jörg.

This is where systemready is a bit different:
* We have a disable config on our checks - this can definitely be improved
and maybe have that in the monitor as currently it's just a way we
implemented the individual checks.
 "WARN" _could_ mean that, but that's
usually not what it means, at least not in any tool I've seen.

The monitoring part is something I think needs to be treaded carefully:
Yes, we can feed this into a monitoring tool, but I would not make the HCs
or systemready or whatever comes of the two a tool for providing
quantitative data, just values for binary (tertiary I guess) metrics
(qualitative info).

I agree with Christian that this might be the best opportunity to review some of the design choices and (personal preference alert!) maybe split it into modules with slightly different concerns. We're anyway going to have the sling mapping to the new SPI in felix so we have backward-compatibility.

Yes, there's a tradeoff, but let's talk about it.

- Andrei

On Fri, Sep 28, 2018 at 7:40 PM Jörg Hoh <jhoh...@googlemail.com.invalid>
wrote:

I don't want to revive this discussion, but just wanted to give some ideas about my ideas when I initially started this with Bertrand (accidentally we
did that together on an adapTo() some years ago).

* the idea was always to use the healthchecks to capture the application state and make it usable for consumption by a loadbalancer or any other
external monitoring system. Implementing other checks (like checks for
security measures being implemented) are also possible, but they have never
been the primary usecase.
* the way how the reporting states "OK", "WARN" and "CRITICAL" are
interpreted, is totally up to the developer implementing the healthchecks and the team operating the system. While "OK" and "CRITICAL" seem quite
Nevertheless, the
developer of healthchecks should have the same understanding.
* The idea was always that it should be possible to change the settings during runtime manually; either to override accidentally incorrect settings or to handle unforseen situations; removing a misbehaving check from the state calculation (manually, without deployment) is definitly a usecase
which should be supported.

Jörg

Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
sseif...@pro-vision.de>:

> - currently there is some overlap between sling health checks and the new
> felix system readyness framework presented [1]
> - the idea is to bring this together within felix
> - provide a facade for the sling healthcheck API for backwards
> compatibility
>
> stefan
>
> [1]
>
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
>
>
>

--
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh

Reply via email to