Hi all,
... "WARN" is always ambigous. Coming with an "old-school" IT
operation background, I would interprete WARN as "it's still working
and
can be used, but we should have a closer look at it".
* WARN is indeed confusing for the LB case - is the instance ready/
alive
or not? That's why we went for GREEN/ YELLOW/ RED. So for us, WARN maps
to
YELLOW but the naming makes the difference clearer: YELLOW is "not
ready
yet but it's a matter of time" and RED is "yeah, this isn't going to be
ready without manual intervention".
This is an interesting point. I agree HealthCheck.WARN is different from
SystemReady.YELLOW today. It could make sense to to introduce a new
status TEMP_CRITICAL for this:
TEMP_CRITICAL -> CRITICAL with tendency to OK (system not functional)
WARN -> OK with tendency to CRITICAL (system fully functional)
I created a table [1] in the wiki to make this clearer.
Regarding the response token names in general (e.g. OK vs GREEN or
CRITICAL vs. RED): In the end it's just a name, machine clients will use
the mapped http response code 503 to that name (so the name does not
matter too much). However I do think that CRITICAL or OK is more
expressive than GREEN or RED and changing the name would make migration
for existing health checks harder. YELLOW does not clearly tell "not
usable on the way to usable", here TEMP_CRITICAL would make that
immediately clear.
But then again, introducing another status might not be KISS, speaking
for the kubernetes use case it would also be enough to just implement
the current SystemReady.YELLOW to return HealthCheck.CRITICAL.
Leaves us to two options:
* Just use HealthCheck.CRITICAL for the startup case in Kubernetes
* Introduce TEMP_CRITICAL to signal CRITICAL with tendency to OK
(whereas WARN is OK with tendency to CRITICAL)
But whatever we choose, we should make sure we include a table like [1]
in the official API documentation to align all health check implementers
as much as possible to the expectation their response status will have.
-Georg
[1]
https://cwiki.apache.org/confluence/display/SLING/Health+Check+Response+Status+Values
On 2018-09-28 20:28, Andrei Dulvac wrote:
Hi Jörg.
This is where systemready is a bit different:
* We have a disable config on our checks - this can definitely be
improved
and maybe have that in the monitor as currently it's just a way we
implemented the individual checks.
"WARN" _could_ mean that, but that's
usually not what it means, at least not in any tool I've seen.
The monitoring part is something I think needs to be treaded carefully:
Yes, we can feed this into a monitoring tool, but I would not make the
HCs
or systemready or whatever comes of the two a tool for providing
quantitative data, just values for binary (tertiary I guess) metrics
(qualitative info).
I agree with Christian that this might be the best opportunity to
review
some of the design choices and (personal preference alert!) maybe split
it
into modules with slightly different concerns. We're anyway going to
have
the sling mapping to the new SPI in felix so we have
backward-compatibility.
Yes, there's a tradeoff, but let's talk about it.
- Andrei
On Fri, Sep 28, 2018 at 7:40 PM Jörg Hoh
<jhoh...@googlemail.com.invalid>
wrote:
I don't want to revive this discussion, but just wanted to give some
ideas
about my ideas when I initially started this with Bertrand
(accidentally we
did that together on an adapTo() some years ago).
* the idea was always to use the healthchecks to capture the
application
state and make it usable for consumption by a loadbalancer or any
other
external monitoring system. Implementing other checks (like checks for
security measures being implemented) are also possible, but they have
never
been the primary usecase.
* the way how the reporting states "OK", "WARN" and "CRITICAL" are
interpreted, is totally up to the developer implementing the
healthchecks
and the team operating the system. While "OK" and "CRITICAL" seem
quite
Nevertheless, the
developer of healthchecks should have the same understanding.
* The idea was always that it should be possible to change the
settings
during runtime manually; either to override accidentally incorrect
settings
or to handle unforseen situations; removing a misbehaving check from
the
state calculation (manually, without deployment) is definitly a
usecase
which should be supported.
Jörg
Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
sseif...@pro-vision.de>:
> - currently there is some overlap between sling health checks and the new
> felix system readyness framework presented [1]
> - the idea is to bring this together within felix
> - provide a facade for the sling healthcheck API for backwards
> compatibility
>
> stefan
>
> [1]
>
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
>
>
>
--
Cheers,
Jörg Hoh,
http://cqdump.wordpress.com
Twitter: @joerghoh