Hi,

the official health check documentation [1] clearly states "Load balancers can query the health of a sling instance and decide to take it out or back into the list of used backends automatically" in use cases. This was designed like that in 2013 (see [2] for the design on the wiki page) and works in many production deployments today. Health checks are highly optimised for exactly the load balancer use case, see e.g. [3].

Now systemready as it stands today is missing the optimisation that the health checks went through (see [4] and slides 24-28 in adaptTo presentation 2014 [5]). There is no parallel execution, there is no timeout handling. System ready ATM forces async and sequential execution [6]. There is no separate api bundle yet and no safeguards against badly tested custom checks that might take down the whole system.

If we don't move health checks to Felix
* developers for the platform will be confused about the two SPI interfaces HealthCheck [7] and SystemReadyCheck [8], there will be many unnecessary discussions around when to use which one * to make systemready ready for production usage a lot of code would have to be copied from Sling Health Checks (or if not copied, reinvented)

There was two suggestions around the move:
* bridging the two worlds: generally this is not KISS (unnecessary code and bugs) and for the direction HC->systemready it does not make sense (HCs are much more feature rich, no need to bridge to other frameworks), for the direction systemready->HC this would mean having a dependency from Felix to Sling (the lower level framework should not depend on the higher level framework) * HCs have some dependencies on other sling modules: As Oliver correctly confirmed, this is really easy to remove as both Scheduler and Threads are *Commons* modules

So given all the above, we decided during the hackathon (~12 committers of both projects were present) that the only reasonable solution is to move Sling HCs to Felix (leaving a simple bridge on Sling side to keep supporting the SPI interface [7]).

Andrei and Christian, let's join forces to make Health Checks as soon they are moved to Felix better! (instead of wasting time getting systemready to the same maturity solving the same problems again)

Best Regards
Georg

[1] https://sling.apache.org/documentation/bundles/sling-health-check-tool.html [2] https://cwiki.apache.org/confluence/display/SLING/Health+Checks+Executor+Design#HealthChecksExecutorDesign-C)HTTPfront-end,machineclient [3] https://issues.apache.org/jira/browse/SLING-5874 - "Health Check Executor unnecessarily wastes 50ms" [4] https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#configuring-the-health-check-executor [5] https://adapt.to/content/dam/adaptto/production/presentations/2014/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf/_jcr_content/renditions/original.media_file.download_attachment.file/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf [6] https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L93
https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L128
[7] https://github.com/apache/sling-org-apache-sling-hc-api/blob/6349d1fe285659dbeb741ab4f1bff664661e1b6f/src/main/java/org/apache/sling/hc/api/HealthCheck.java#L35 [8] https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/SystemReadyCheck.java#L30


On 2018-09-21 13:41, Oliver Lietz wrote:
On Wednesday 19 September 2018 10:36:24 Andrei Dulvac wrote:
Hi guys.

Hi,

So first of all I acknowledge that conceptually there is some overlap - the
concepts of health, readiness, liveness themselves overlap.
When we wrote systemready, we did know of the Sling HCs (at least I did) and how they're used. And that was one of the reasons why we decided not to
use them.

They're currently used, as Justin put it, for a much broader scope. A
system can fail a HC and it doesn't mean it's not ready. In one of
Bertrand's adapt.to presentation from 2013 [0], a security checklist is mentioned explicitly - which we use in AEM. It's one of those things that
requires manual input to turn healthy. The docu also mentions a lot of
stuff, including susing them as serverside Junit tests [1]. And all those
things are great.

> Now the system readyness framework was mostly created to have something

on Felix level and the capabilities of the Sling Health Checks weren’t
known.

Not entirely accurate. We knew of the sling HCs and initially we wanted to donate systemready to sling; but it's definitely good it went into felix.

> The dependencies of Sling HC to Sling are minimal today already: It’s

Sling thread pool (a felix pendant or just a plain java one can be used) and Sling Scheduler (also this can easily be replaced by the standard java
mechanism).

In my opinion, that's A LOT. And they're prefixed by "Sling-". Systemready
has two dependencies: javax.servlet and the osgi API. And it can
technically run on any framewok. The deps were another reason why we didn't
use the HCs. Of course, those might grow as it becomes more mature.

both Scheduler and Threads are *Commons* modules and can run without Sling.

Regards,
O.

> What would make sense is a bridge where a subset of health checks could

be fed into the readyness framework (i.e. if these X health checks pass,
the system is considered "ready" and/or "alive").

> (you just create two tags for readiness and liveness each).

These don't seem to contradict each other.
Stefan, did you mean that the SystemReady checks would also become some tagged HCs or the other way around? That some tagged HCs would be fed into
systemready?

So I'm game for unifying a bit at the felix level and hopefully we don't go overboard. I alone just don't have a solution yet that I can say I love
100%.

BTW, Sorry I couldn't make it to the hackathon, it would have been great to
be part of the discussion.

- Andrei




---
[0] https://adapt.to/2013/en/schedule/18_healthcheck.html
[1]
https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#
health-checks-as-server-side-junit-tests


On Wed, Sep 19, 2018 at 1:15 AM Justin Edelson <jus...@justinedelson.com>

wrote:
> Hi Georg,
> Great. It looks like I misread Stefan's notes as being more dramatic than
> they actually were intended to be :)
>
> Regards,
> Justin
>
> On Tue, Sep 18, 2018 at 4:48 PM Georg Henzler <slin...@ghenzler.de> wrote:
> > Hi Justin,
> >
> > there was quite some discussion at adaptTo() around this topic already.
>
> So
>
> > as it stands all requirements to run Sling-based applications in
>
> Kubernetes
>
> > are met already by Sling Health Checks (you just create two tags for
> > readiness and liveness each). HCs were developed from the first day with
> > the goal to have them used by load balancers (and not only manual). Also
> > Sling HCs are more mature in terms of parallel execution, timeout
>
> handling,
>
> > response customizing and special handling like asynchronous checks.
> >
> >
> > Now the system readyness framework was mostly created to have something
>
> on
>
> > Felix level and the capabilities of the Sling Health Checks weren’t
>
> known.
>
> > I do agree that it would make sense to have it on Felix level though
>
> (more
>
> > visible to the non-Sling world, as a low level mechanism maybe best
>
> located
>
> > at the lowest framework level). The dependencies of Sling HC to Sling
> > are
> > minimal today already: It’s Sling thread pool (a felix pendant or just a
> > plain java one can be used) and Sling Scheduler (also this can easily be
> > replaced by the standard java mechanism).
> >
> > > It might make more sense to invert this and identify what the
> > > readyness
> >
> > framework does (mostly in its OOTB checks and servlets)
> >
> > > and merge that functionality into Sling Health Checks and then move
>
> Sling
>
> > > Health Checks (or solid chunks of it) to Felix.
> >
> > This was the intention, but let’s wait for the feedback from Andrei and
> > Christian.
> >
> > -Georg
> >
> > Sent from my iPhone
> >
> > > On 18. Sep 2018, at 16:31, Justin Edelson <jus...@justinedelson.com>
> >
> > wrote:
> > > Hi,
> > > After reviewing the presentation, this seems like kind of a stretch to
> >
> > me.
> >
> > > IIUC, the System Readyness Framework is (as its name would suggest)
> >
> > solely
> >
> > > concerned with "readyness"  and "liveness" (as seen in the example use
> > > cases on slide 3) and the API is explicitly designed for this purpose
> > > without any opportunity for namespace extension (i.e. you can extend
>
> how
>
> > > "readyness" and "liveness" are determined but you can't add new
> > > categories). Sling Health Checks is concerned with a broader concept
> > > of
> > > "health" with no restrictions on namespacing. There are all kinds of
> > > reasons why a system may be considered "ready" but still fails
> > > specific
> > > health checks. In other words, I'm doubtful that there is an overlap
>
> here
>
> > > at a framework level. What would make sense is a bridge where a subset
>
> of
>
> > > health checks could be fed into the readyness framework (i.e. if these
>
> X
>
> > > health checks pass, the system is considered "ready" and/or "alive").
>
> But
>
> > > I'd strongly suggest that the gamut of expression possible with the
> >
> > health
> >
> > > check framework goes far beyond the scope of what the readyness
>
> framework
>
> > > is designed to do. It might make more sense to invert this and
> > > identify
> > > what the readyness framework does (mostly in its OOTB checks and
> >
> > servlets)
> >
> > > and merge that functionality into Sling Health Checks and then move
>
> Sling
>
> > > Health Checks (or solid chunks of it) to Felix.
> > >
> > > Or perhaps I've misunderstood the intention of this email/F2F
>
> discussion.
>
> > > But the way this looks is that we are going to take something with a
> >
> > decent
> >
> > > install base and replace it with something a few months old and a much
> > > smaller functional scope. Just doesn't make sense to me.
> > >
> > > Regards,
> > > Justin
> > >
> > > On Thu, Sep 13, 2018 at 1:03 PM Stefan Seifert <sseif...@pro-vision.de
> > >
> > > wrote:
> > >> - currently there is some overlap between sling health checks and the
> >
> > new
> >
> > >> felix system readyness framework presented [1]
> > >> - the idea is to bring this together within felix
> > >> - provide a facade for the sling healthcheck API for backwards
> > >> compatibility
> > >>
> > >> stefan
> > >>
> > >> [1]
>
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deploym
> ent-automation-a-breeze.html

Reply via email to