Re: [hackathon] health checks

Georg Henzler Sun, 23 Sep 2018 02:45:13 -0700

Hi,

the official health check documentation [1] clearly states "Loadbalancers can query the health of a sling instance and decide to take itout or back into the list of used backends automatically" in use cases.This was designed like that in 2013 (see [2] for the design on the wikipage) and works in many production deployments today. Health checks arehighly optimised for exactly the load balancer use case, see e.g. [3].

Now systemready as it stands today is missing the optimisation that thehealth checks went through (see [4] and slides 24-28 in adaptTopresentation 2014 [5]). There is no parallel execution, there is notimeout handling. System ready ATM forces async and sequential execution[6]. There is no separate api bundle yet and no safeguards against badlytested custom checks that might take down the whole system.


If we don't move health checks to Felix

* developers for the platform will be confused about the two SPIinterfaces HealthCheck [7] and SystemReadyCheck [8], there will be manyunnecessary discussions around when to use which one* to make systemready ready for production usage a lot of code wouldhave to be copied from Sling Health Checks (or if not copied,reinvented)


There was two suggestions around the move:

* bridging the two worlds: generally this is not KISS (unnecessary codeand bugs) and for the direction HC->systemready it does not make sense(HCs are much more feature rich, no need to bridge to other frameworks),for the direction systemready->HC this would mean having a dependencyfrom Felix to Sling (the lower level framework should not depend on thehigher level framework)* HCs have some dependencies on other sling modules: As Oliver correctlyconfirmed, this is really easy to remove as both Scheduler and Threadsare *Commons* modules

So given all the above, we decided during the hackathon (~12 committersof both projects were present) that the only reasonable solution is tomove Sling HCs to Felix (leaving a simple bridge on Sling side to keepsupporting the SPI interface [7]).

Andrei and Christian, let's join forces to make Health Checks as soonthey are moved to Felix better!(instead of wasting time getting systemready to the same maturitysolving the same problems again)


Best Regards
Georg

[1]https://sling.apache.org/documentation/bundles/sling-health-check-tool.html[2]https://cwiki.apache.org/confluence/display/SLING/Health+Checks+Executor+Design#HealthChecksExecutorDesign-C)HTTPfront-end,machineclient[3] https://issues.apache.org/jira/browse/SLING-5874 - "Health CheckExecutor unnecessarily wastes 50ms"[4]https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#configuring-the-health-check-executor[5]https://adapt.to/content/dam/adaptto/production/presentations/2014/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf/_jcr_content/renditions/original.media_file.download_attachment.file/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf[6]https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L93

https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L128

[7]https://github.com/apache/sling-org-apache-sling-hc-api/blob/6349d1fe285659dbeb741ab4f1bff664661e1b6f/src/main/java/org/apache/sling/hc/api/HealthCheck.java#L35[8]https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/SystemReadyCheck.java#L30



On 2018-09-21 13:41, Oliver Lietz wrote:

On Wednesday 19 September 2018 10:36:24 Andrei Dulvac wrote:

Hi guys.

Hi,

So first of all I acknowledge that conceptually there is some overlap- the
concepts of health, readiness, liveness themselves overlap.
When we wrote systemready, we did know of the Sling HCs (at least Idid)and how they're used. And that was one of the reasons why we decidednot to
use them.

They're currently used, as Justin put it, for a much broader scope. A
system can fail a HC and it doesn't mean it's not ready. In one of
Bertrand's adapt.to presentation from 2013 [0], a security checklistismentioned explicitly - which we use in AEM. It's one of those thingsthat
requires manual input to turn healthy. The docu also mentions a lot of
stuff, including susing them as serverside Junit tests [1]. And allthose
things are great.

> Now the system readyness framework was mostly created to have something

on Felix level and the capabilities of the Sling Health Checks weren’t
known.
Not entirely accurate. We knew of the sling HCs and initially wewanted todonate systemready to sling; but it's definitely good it went intofelix.
> The dependencies of Sling HC to Sling are minimal today already: It’s
Sling thread pool (a felix pendant or just a plain java one can beused)and Sling Scheduler (also this can easily be replaced by the standardjava
mechanism).
In my opinion, that's A LOT. And they're prefixed by "Sling-".Systemready
has two dependencies: javax.servlet and the osgi API. And it can
technically run on any framewok. The deps were another reason why wedidn't
use the HCs. Of course, those might grow as it becomes more mature.

both Scheduler and Threads are *Commons* modules and can run withoutSling.


Regards,
O.

> What would make sense is a bridge where a subset of health checks could

be fed into the readyness framework (i.e. if these X health checkspass,

the system is considered "ready" and/or "alive").

> (you just create two tags for readiness and liveness each).

These don't seem to contradict each other.

Stefan, did you mean that the SystemReady checks would also becomesometagged HCs or the other way around? That some tagged HCs would be fedinto

systemready?

So I'm game for unifying a bit at the felix level and hopefully wedon't gooverboard. I alone just don't have a solution yet that I can say Ilove

100%.

BTW, Sorry I couldn't make it to the hackathon, it would have beengreat to

be part of the discussion.

- Andrei




---
[0] https://adapt.to/2013/en/schedule/18_healthcheck.html
[1]
https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#
health-checks-as-server-side-junit-tests

On Wed, Sep 19, 2018 at 1:15 AM Justin Edelson<jus...@justinedelson.com>


wrote:
> Hi Georg,
> Great. It looks like I misread Stefan's notes as being more dramatic than
> they actually were intended to be :)
>
> Regards,
> Justin
>
> On Tue, Sep 18, 2018 at 4:48 PM Georg Henzler <slin...@ghenzler.de> wrote:
> > Hi Justin,
> >
> > there was quite some discussion at adaptTo() around this topic already.
>
> So
>
> > as it stands all requirements to run Sling-based applications in
>
> Kubernetes
>
> > are met already by Sling Health Checks (you just create two tags for
> > readiness and liveness each). HCs were developed from the first day with
> > the goal to have them used by load balancers (and not only manual). Also
> > Sling HCs are more mature in terms of parallel execution, timeout
>
> handling,
>
> > response customizing and special handling like asynchronous checks.
> >
> >
> > Now the system readyness framework was mostly created to have something
>
> on
>
> > Felix level and the capabilities of the Sling Health Checks weren’t
>
> known.
>
> > I do agree that it would make sense to have it on Felix level though
>
> (more
>
> > visible to the non-Sling world, as a low level mechanism maybe best
>
> located
>
> > at the lowest framework level). The dependencies of Sling HC to Sling
> > are
> > minimal today already: It’s Sling thread pool (a felix pendant or just a
> > plain java one can be used) and Sling Scheduler (also this can easily be
> > replaced by the standard java mechanism).
> >
> > > It might make more sense to invert this and identify what the
> > > readyness
> >
> > framework does (mostly in its OOTB checks and servlets)
> >
> > > and merge that functionality into Sling Health Checks and then move
>
> Sling
>
> > > Health Checks (or solid chunks of it) to Felix.
> >
> > This was the intention, but let’s wait for the feedback from Andrei and
> > Christian.
> >
> > -Georg
> >
> > Sent from my iPhone
> >
> > > On 18. Sep 2018, at 16:31, Justin Edelson <jus...@justinedelson.com>
> >
> > wrote:
> > > Hi,
> > > After reviewing the presentation, this seems like kind of a stretch to
> >
> > me.
> >
> > > IIUC, the System Readyness Framework is (as its name would suggest)
> >
> > solely
> >
> > > concerned with "readyness"  and "liveness" (as seen in the example use
> > > cases on slide 3) and the API is explicitly designed for this purpose
> > > without any opportunity for namespace extension (i.e. you can extend
>
> how
>
> > > "readyness" and "liveness" are determined but you can't add new
> > > categories). Sling Health Checks is concerned with a broader concept
> > > of
> > > "health" with no restrictions on namespacing. There are all kinds of
> > > reasons why a system may be considered "ready" but still fails
> > > specific
> > > health checks. In other words, I'm doubtful that there is an overlap
>
> here
>
> > > at a framework level. What would make sense is a bridge where a subset
>
> of
>
> > > health checks could be fed into the readyness framework (i.e. if these
>
> X
>
> > > health checks pass, the system is considered "ready" and/or "alive").
>
> But
>
> > > I'd strongly suggest that the gamut of expression possible with the
> >
> > health
> >
> > > check framework goes far beyond the scope of what the readyness
>
> framework
>
> > > is designed to do. It might make more sense to invert this and
> > > identify
> > > what the readyness framework does (mostly in its OOTB checks and
> >
> > servlets)
> >
> > > and merge that functionality into Sling Health Checks and then move
>
> Sling
>
> > > Health Checks (or solid chunks of it) to Felix.
> > >
> > > Or perhaps I've misunderstood the intention of this email/F2F
>
> discussion.
>
> > > But the way this looks is that we are going to take something with a
> >
> > decent
> >
> > > install base and replace it with something a few months old and a much
> > > smaller functional scope. Just doesn't make sense to me.
> > >
> > > Regards,
> > > Justin
> > >
> > > On Thu, Sep 13, 2018 at 1:03 PM Stefan Seifert <sseif...@pro-vision.de
> > >
> > > wrote:
> > >> - currently there is some overlap between sling health checks and the
> >
> > new
> >
> > >> felix system readyness framework presented [1]
> > >> - the idea is to bring this together within felix
> > >> - provide a facade for the sling healthcheck API for backwards
> > >> compatibility
> > >>
> > >> stefan
> > >>
> > >> [1]
>
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deploym
> ent-automation-a-breeze.html

Re: [hackathon] health checks

Reply via email to