Re: [hackathon] health checks

Andrei Dulvac Mon, 24 Sep 2018 03:24:22 -0700

Hi Georg, all.

Please check my answers inline.


On Sun, Sep 23, 2018 at 11:45 AM Georg Henzler <slin...@ghenzler.de> wrote:

> Hi,
>
> the official health check documentation [1] clearly states "Load
> balancers can query the health of a sling instance and decide to take it
> out or back into the list of used backends automatically" in use cases.
>

My problem is it's not really used like that at all, at least not in any
AEM deployment that I know of. _Because_ it does more than just that. For
example, there are ootb healtchecks in AEM [9] and they are NOT used, to my
knowledge, for the load balancer use case.


> This was designed like that in 2013 (see [2] for the design on the wiki
> page) and works in many production deployments today.


This is not about AEM, but I don't know of other production systems,
definitely don't know of any where sling HCs are used for the LB directing
traffic from or away from an instance where the checks fail. You say there
are many. Could you share some examples? This would help me appreciate the
issue more.


> Health checks are
> highly optimised for exactly the load balancer use case, see e.g. [3].
>

I understand the optimization aprt, and while systemready might clearly
need some optimizations, I personally don't see it as a reasonable concern.
Kubernetes, for example, retries the liveness and readiness checks a few
times before deciding to act. Do 50ms actually matter here?


>
> Now systemready as it stands today is missing the optimisation that the
> health checks went through (see [4] and slides 24-28 in adaptTo
> presentation 2014 [5]). There is no parallel execution, there is no
> timeout handling.


That is true and I agree those optimizations might be needed. Timeouts are
trivial, parallel execution is not necessarily. It is clear that HCs are
way ahead here.


> System ready ATM forces async and sequential execution
> [6].


Async is not an issue, IMO, sequential could be a limitation (same as above)


> There is no separate api bundle yet


True. Somebody needs to explain to me why that is a big deal for a very
small tool (I'm not that experienced with that matter)


> and no safeguards against badly
> tested custom checks that might take down the whole system.
>

Sigh. There is no safeguard anywhere around any custom bundle doing
System.exit(0) or any other thing taking down a whole system. Whoever is
packaging a top-level app or customizing a deployment needs to own and take
responsibility over what checks go in. Like any other bundle...

Yes, timeouts can make it easier to not _accidentally_ do something bad,
but then we're opening a new dimension of complexity. What happens if a
check times out? A. Do we ignore it? B. Do we fail it the same way as a
_normal_ failure? C. Do we fail it in a different way? And timeouts can be
sorted out easily - and I assume that's the safeguards you ahve in mind,
which you're mentioned already above - but if you mean other safeguards,
those increase the complexity.

>
> If we don't move health checks to Felix
> * developers for the platform will be confused about the two SPI
> interfaces HealthCheck [7] and SystemReadyCheck [8], there will be many
> unnecessary discussions around when to use which one
>

I don't really agree about the reasoning. If we do make a bridge, they are
layered and we can keep options open. But I actually agree about moving
them to Felix, for slightly different reasons, which is exposure and
decoupling.


> * to make systemready ready for production usage a lot of code would
> have to be copied from Sling Health Checks (or if not copied,
> reinvented)
>

It's being used in AEM already (alpha, beta). What dangerously complex
production setup do you have in mind where we would need to write a lot of
code and we absolutely can't use a simple tool, where its limitations are
arguably not important? And again, my question stands: what are the many
production envs where HCs are used _for the LB usecase_?

>
> There was two suggestions around the move:
> * bridging the two worlds: generally this is not KISS (unnecessary code
> and bugs) and for the direction HC->systemready it does not make sense
> (HCs are much more feature rich, no need to bridge to other frameworks),
>

I respectfully disagree about the KISS part - if anything systemready is
KISS - as simple as possible, disregarding limitations that don't matter
for the single usecase it covers. But I actually agree a bridge per se is
not an ideal solution.


> for the direction systemready->HC this would mean having a dependency
> from Felix to Sling (the lower level framework should not depend on the
> higher level framework)
>

Clearly. But we'd be moving it to felix, right? Anyways, I agree that a
more constrained tool shouldn't depend on a more complex one if simplicity
is a desirable state.


> * HCs have some dependencies on other sling modules: As Oliver correctly
> confirmed, this is really easy to remove as both Scheduler and Threads
> are *Commons* modules
>

Cool! :) I think that's a great thing to decouple.

>
> So given all the above, we decided during the hackathon (~12 committers
> of both projects were present)


I so wish I'd made it! Argh :) I assume you meant that the consensus _in
the hackathon_ was to propose this to the list as a good solution and reach
a decision *on the list*. Right? ;)


> that the only reasonable solution is to
> move Sling HCs to Felix (leaving a simple bridge on Sling side to keep
> supporting the SPI interface [7]).
>

I fully support that! But what Stefan was saying doesn't match what you're
proposing and what you're proposing is not part of the -decision- consensus
you reached during the hackathon. Or did I misunderstand?


>
> Andrei and Christian, let's join forces to make Health Checks as soon
> they are moved to Felix better!
> (instead of wasting time getting systemready to the same maturity
> solving the same problems again)
>

I dislike the conviction that systemready is not system ready (see what I
did there? ;) ) and that we need to solve the same problems as with HCs.
BTW, I'm not naive. I know there is some overlap and some things that were
solved in HCs would need to be solved in systemready. But the more
important thing is that some things that were solved in HCs do NOT need to
be solved in systemready.

That being said, I personally support moving Sling HCs to felix and I do
agree it'd be slightly confusing if we had two similar mechanisms (sorry to
ask 3 times, but I'm really curious of the LB usecase for HCs) so I agree
about *discussing* the best way to merge. And this will sound weird, but
wouldn't it make more sense to have the Sling HCs codebase *extend*
systemready? The move in felix comes as a new module, effectively - there's
clearly dependencies and namespace changes. We, in AEM already use
systemready (small and not fully mature as it may be). Can we carefully
discuss what goodies from HCs would fit how and maybe even keep it as
different bundles under the same umbrella and common API? A bridge at the
felix level, as you said, wouldn't make much sense. there will be a bridge
already between what goes into felix and the Sling HCs in sling.

I don't really care about the ownership and the naming (for any other
reasons than what makes sense); we wrote systemready because we wanted to
solve one problem in a very simple way. And all the features and the ways
that Sling HCs could be used meant for me it wasn't usable for the LB
usecase, in particular. It certainly doesn't work in AEM without changes -
as it was discussed, labelling Sling HCs.

Can we make sure we bring those together in a way that keeps the simplicity
for the original usecase of systemready? Please? Whatever the parameters of
the merge. And can we seriosuly discuss the idea of having HCs as an
extension of systemready?

Christian, what's your input?

Best,
- Andrei

>
> Best Regards
> Georg
>


[9]
https://helpx.adobe.com/experience-manager/6-3/sites/administering/using/operations-dashboard.html



>
> [1]
> https://sling.apache.org/documentation/bundles/sling-health-check-tool.html
> [2]
>
> https://cwiki.apache.org/confluence/display/SLING/Health+Checks+Executor+Design#HealthChecksExecutorDesign-C)HTTPfront-end,machineclient
> [3] https://issues.apache.org/jira/browse/SLING-5874 - "Health Check
> Executor unnecessarily wastes 50ms"
> [4]
>
> https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#configuring-the-health-check-executor
> [5]
>
> https://adapt.to/content/dam/adaptto/production/presentations/2014/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf/_jcr_content/renditions/original.media_file.download_attachment.file/adaptTo2014-Sling-Health-Checks-New-And-Noteworthy-Georg-Henzler.pdf
> [6]
>
> https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L93
>
> https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/impl/SystemReadyMonitorImpl.java#L128
> [7]
>
> https://github.com/apache/sling-org-apache-sling-hc-api/blob/6349d1fe285659dbeb741ab4f1bff664661e1b6f/src/main/java/org/apache/sling/hc/api/HealthCheck.java#L35
> [8]
>
> https://github.com/apache/felix/blob/783666338c69ab1395f6ca62810d029aa48e70d5/systemready/src/main/java/org/apache/felix/systemready/SystemReadyCheck.java#L30
>
>
> On 2018-09-21 13:41, Oliver Lietz wrote:
> > On Wednesday 19 September 2018 10:36:24 Andrei Dulvac wrote:
> >> Hi guys.
> >
> > Hi,
> >
> >> So first of all I acknowledge that conceptually there is some overlap
> >> - the
> >> concepts of health, readiness, liveness themselves overlap.
> >> When we wrote systemready, we did know of the Sling HCs (at least I
> >> did)
> >> and how they're used. And that was one of the reasons why we decided
> >> not to
> >> use them.
> >>
> >> They're currently used, as Justin put it, for a much broader scope. A
> >> system can fail a HC and it doesn't mean it's not ready. In one of
> >> Bertrand's adapt.to presentation from 2013 [0], a security checklist
> >> is
> >> mentioned explicitly - which we use in AEM. It's one of those things
> >> that
> >> requires manual input to turn healthy. The docu also mentions a lot of
> >> stuff, including susing them as serverside Junit tests [1]. And all
> >> those
> >> things are great.
> >>
> >> > Now the system readyness framework was mostly created to have
> something
> >>
> >> on Felix level and the capabilities of the Sling Health Checks weren’t
> >> known.
> >>
> >> Not entirely accurate. We knew of the sling HCs and initially we
> >> wanted to
> >> donate systemready to sling; but it's definitely good it went into
> >> felix.
> >>
> >> > The dependencies of Sling HC to Sling are minimal today already: It’s
> >>
> >> Sling thread pool (a felix pendant or just a plain java one can be
> >> used)
> >> and Sling Scheduler (also this can easily be replaced by the standard
> >> java
> >> mechanism).
> >>
> >> In my opinion, that's A LOT. And they're prefixed by "Sling-".
> >> Systemready
> >> has two dependencies: javax.servlet and the osgi API. And it can
> >> technically run on any framewok. The deps were another reason why we
> >> didn't
> >> use the HCs. Of course, those might grow as it becomes more mature.
> >
> > both Scheduler and Threads are *Commons* modules and can run without
> > Sling.
> >
> > Regards,
> > O.
> >
> >> > What would make sense is a bridge where a subset of health checks
> could
> >>
> >> be fed into the readyness framework (i.e. if these X health checks
> >> pass,
> >> the system is considered "ready" and/or "alive").
> >>
> >> > (you just create two tags for readiness and liveness each).
> >>
> >> These don't seem to contradict each other.
> >> Stefan, did you mean that the SystemReady checks would also become
> >> some
> >> tagged HCs or the other way around? That some tagged HCs would be fed
> >> into
> >> systemready?
> >>
> >> So I'm game for unifying a bit at the felix level and hopefully we
> >> don't go
> >> overboard. I alone just don't have a solution yet that I can say I
> >> love
> >> 100%.
> >>
> >> BTW, Sorry I couldn't make it to the hackathon, it would have been
> >> great to
> >> be part of the discussion.
> >>
> >> - Andrei
> >>
> >>
> >>
> >>
> >> ---
> >> [0] https://adapt.to/2013/en/schedule/18_healthcheck.html
> >> [1]
> >>
> https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#
> >> health-checks-as-server-side-junit-tests
> >>
> >>
> >> On Wed, Sep 19, 2018 at 1:15 AM Justin Edelson
> >> <jus...@justinedelson.com>
> >>
> >> wrote:
> >> > Hi Georg,
> >> > Great. It looks like I misread Stefan's notes as being more dramatic
> than
> >> > they actually were intended to be :)
> >> >
> >> > Regards,
> >> > Justin
> >> >
> >> > On Tue, Sep 18, 2018 at 4:48 PM Georg Henzler <slin...@ghenzler.de>
> wrote:
> >> > > Hi Justin,
> >> > >
> >> > > there was quite some discussion at adaptTo() around this topic
> already.
> >> >
> >> > So
> >> >
> >> > > as it stands all requirements to run Sling-based applications in
> >> >
> >> > Kubernetes
> >> >
> >> > > are met already by Sling Health Checks (you just create two tags for
> >> > > readiness and liveness each). HCs were developed from the first day
> with
> >> > > the goal to have them used by load balancers (and not only manual).
> Also
> >> > > Sling HCs are more mature in terms of parallel execution, timeout
> >> >
> >> > handling,
> >> >
> >> > > response customizing and special handling like asynchronous checks.
> >> > >
> >> > >
> >> > > Now the system readyness framework was mostly created to have
> something
> >> >
> >> > on
> >> >
> >> > > Felix level and the capabilities of the Sling Health Checks weren’t
> >> >
> >> > known.
> >> >
> >> > > I do agree that it would make sense to have it on Felix level though
> >> >
> >> > (more
> >> >
> >> > > visible to the non-Sling world, as a low level mechanism maybe best
> >> >
> >> > located
> >> >
> >> > > at the lowest framework level). The dependencies of Sling HC to
> Sling
> >> > > are
> >> > > minimal today already: It’s Sling thread pool (a felix pendant or
> just a
> >> > > plain java one can be used) and Sling Scheduler (also this can
> easily be
> >> > > replaced by the standard java mechanism).
> >> > >
> >> > > > It might make more sense to invert this and identify what the
> >> > > > readyness
> >> > >
> >> > > framework does (mostly in its OOTB checks and servlets)
> >> > >
> >> > > > and merge that functionality into Sling Health Checks and then
> move
> >> >
> >> > Sling
> >> >
> >> > > > Health Checks (or solid chunks of it) to Felix.
> >> > >
> >> > > This was the intention, but let’s wait for the feedback from Andrei
> and
> >> > > Christian.
> >> > >
> >> > > -Georg
> >> > >
> >> > > Sent from my iPhone
> >> > >
> >> > > > On 18. Sep 2018, at 16:31, Justin Edelson <
> jus...@justinedelson.com>
> >> > >
> >> > > wrote:
> >> > > > Hi,
> >> > > > After reviewing the presentation, this seems like kind of a
> stretch to
> >> > >
> >> > > me.
> >> > >
> >> > > > IIUC, the System Readyness Framework is (as its name would
> suggest)
> >> > >
> >> > > solely
> >> > >
> >> > > > concerned with "readyness"  and "liveness" (as seen in the
> example use
> >> > > > cases on slide 3) and the API is explicitly designed for this
> purpose
> >> > > > without any opportunity for namespace extension (i.e. you can
> extend
> >> >
> >> > how
> >> >
> >> > > > "readyness" and "liveness" are determined but you can't add new
> >> > > > categories). Sling Health Checks is concerned with a broader
> concept
> >> > > > of
> >> > > > "health" with no restrictions on namespacing. There are all kinds
> of
> >> > > > reasons why a system may be considered "ready" but still fails
> >> > > > specific
> >> > > > health checks. In other words, I'm doubtful that there is an
> overlap
> >> >
> >> > here
> >> >
> >> > > > at a framework level. What would make sense is a bridge where a
> subset
> >> >
> >> > of
> >> >
> >> > > > health checks could be fed into the readyness framework (i.e. if
> these
> >> >
> >> > X
> >> >
> >> > > > health checks pass, the system is considered "ready" and/or
> "alive").
> >> >
> >> > But
> >> >
> >> > > > I'd strongly suggest that the gamut of expression possible with
> the
> >> > >
> >> > > health
> >> > >
> >> > > > check framework goes far beyond the scope of what the readyness
> >> >
> >> > framework
> >> >
> >> > > > is designed to do. It might make more sense to invert this and
> >> > > > identify
> >> > > > what the readyness framework does (mostly in its OOTB checks and
> >> > >
> >> > > servlets)
> >> > >
> >> > > > and merge that functionality into Sling Health Checks and then
> move
> >> >
> >> > Sling
> >> >
> >> > > > Health Checks (or solid chunks of it) to Felix.
> >> > > >
> >> > > > Or perhaps I've misunderstood the intention of this email/F2F
> >> >
> >> > discussion.
> >> >
> >> > > > But the way this looks is that we are going to take something
> with a
> >> > >
> >> > > decent
> >> > >
> >> > > > install base and replace it with something a few months old and a
> much
> >> > > > smaller functional scope. Just doesn't make sense to me.
> >> > > >
> >> > > > Regards,
> >> > > > Justin
> >> > > >
> >> > > > On Thu, Sep 13, 2018 at 1:03 PM Stefan Seifert <
> sseif...@pro-vision.de
> >> > > >
> >> > > > wrote:
> >> > > >> - currently there is some overlap between sling health checks
> and the
> >> > >
> >> > > new
> >> > >
> >> > > >> felix system readyness framework presented [1]
> >> > > >> - the idea is to bring this together within felix
> >> > > >> - provide a facade for the sling healthcheck API for backwards
> >> > > >> compatibility
> >> > > >>
> >> > > >> stefan
> >> > > >>
> >> > > >> [1]
> >> >
> >> >
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deploym
> >> > ent-automation-a-breeze.html
>

Re: [hackathon] health checks

Reply via email to