Re: [hackathon] health checks

Georg Henzler Mon, 24 Sep 2018 06:42:42 -0700

Hi Christian, hi Andrei,

after reading through the comments, the most important points (as asummary) first:

* Health Checks are already used by many deployments for load balancersin order to not have to have to manually reconfigure LBs duringproduction deployments (I will not post a list of blue chip companies inopen source mailing list though).

* I sense agreement to take Health Checks to Felix, this is good :). HCsare a proven technology that cover the exact same use case assystemready and are more mature (having been around for 5 years).

* HCs today are ready to be used with Kubernetes and ootb AEM, justconfigure the HC servlet [1], and define a tag (e.g. "systemready") byadding it to InactiveBundlesHealthCheck and any other checks you needfor this to it. When using a composite nodestore setup with Docker, justadd the OSGi configs for the servlet and the configs for the "tagamendments" (using prop "hc.tags") to the provisioning model - done. Toensure you get 5x response just configure kubernetes probes [2] withhttp urls like /system/health/systemready.txt?httpStatus=CRITICAL:503(note that passing in query parameters for Kubernetes did not alwayswork, but since 2016 it does [3])

* We really have to make sure that we end up with exactly one SPIinterface to provide checks. The current HC interfaces was discussed inlengths when we introduced it. There is a good reason why we don't havea getName() method and rather use OSGi property "hc.name"(reconfigurability).

* Having had a close look at systemready and knowing HCs very well(having written a fair share of the code), I absolutely think it isnecessary to start with the health check as a base and merge in ideasfrom systemready (and not the other way round) - this was also JustinEdelson's initial response to this thread.


I will answer all other questions below [4].

Best Regards
Georg

[1]https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet[2]https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

[3] https://github.com/kubernetes/kubernetes/pull/25064


[4]

Sling health checks support the concept of tags which allows toconfigure the special meaning of readiness and liveness as tags. So Ithink technically the HC framework should be able to cover our casetoo.


exactly

So I would like to extend to felix systemready project to learn fromsling hc and add some of the features there too. I think the mostimportant thing are tags and a solid model for executors. I would behappy about any help with this from the sling community side.

We really have to start with the Sling HC code and merge in system readyaspects

Another question is if we want to add felix systemready to the slingdistro at some point. Would the sling community be interested in this?

yes, the felix health check should be added to the distro in the sameway as the Sling HC is today

... there are ootb healthchecks in AEM [9] and they are NOT used, to myknowledge, for the load balancer use case.

you rarely run all checks, you always run the checks for a particulartag you are interested in

sling HCs are used for the LB directing traffic... You say there aremany. Could you share some examples?

Commonly used for production deployments (I work for a integrationpartner, we use this for all projects across all clients, but alsoothers use it as there was many talks at conferences about it)

I understand the optimization aprt, and while systemready might clearlyneed some optimizations, I personally don't see it as a reasonableconcern. Kubernetes, for example, retries the liveness and readinesschecks a few times before deciding to act. Do 50ms actually matterhere?

yes, 50ms matter (I created this issue after an operations departmentrefused to use this for its bad performance, since it is fixed they werehappy :)

parallel execution is not necessarily

After my 5 years long experience: parallel execution is absolutelynecessary, otherwise response times get to long.

Async is not an issue

I don't think async is ideal for what you are doing at the moment (thecurrent systemready impl with default config possibly delays the correctresult for 5 sec, this is not a good idea IMHO)

There is no separate api bundle yet
True. Somebody needs to explain to me why that is a big deal for a verysmall tool (I'm not that experienced with that matter)


See SLING-6773

Yes, timeouts can make it easier to not _accidentally_ do somethingbad, but then we're opening a new dimension of complexity. What happensif a check times out?...

We have discussed this in detail some years ago, we have a good solution(WARN by default, CRITICAL after a configurable time). Note: In the HCworld you don't take instances offline for WARN, only for CRITICAL.

...developers for the platform will be confused about the two SPIinterfaces HealthCheck [7] and SystemReadyCheck [8], there will bemany unnecessary discussions around when to use which one
I don't really agree about the reasoning. If we do make a bridge, theyare layered and we can keep options open.

Please no bridge and no duplicate SPI interface! What option would youkeep open? I cannot think of anything. Please note that the functionalscope of HCs are fully covered by HCs. The AEM platform has sufferednumerous times of the "too many options problem" - I work at a serviceprovider and know exactly how much time is completely wasted by peoplediscussing all these different options. Please note the problem is atscale: It will affect thousands of developers!

But I actually agree about moving them to Felix, for slightly differentreasons, which is exposure and decoupling.


great :)

It's being used in AEM already (alpha, beta).

I think you should try using ootb health checks as described at the topof this email.

I respectfully disagree about the KISS part - if anything systemreadyis KISS - as simple as possible, disregarding limitations that don'tmatter for the single usecase it covers. But I actually agree a bridgeper se is not an ideal solution.

Bridges are not KISS but ugly (extra code, hard tounderstand/troubleshoot, extra code/bugs). For systemready being KISS:yes it's easy, but it does not help being KISS while disregarding someimportant parts. HCs are KISS in a way that they solve the problem inthe easiest possible way (I believe).

But what Stefan was saying doesn't match what you're proposing and whatyou're proposing is not part of the -decision- consensus you reachedduring the hackathon. Or did I misunderstand?

Stefan's wording maybe wasn't perfect. But the agreement at theHackathon was to move Sling HC to Felix and merge useful things fromsystemready in using Sling HCs as base.

wouldn't it make more sense to have the Sling HCs codebase *extend*systemready?

This won't work. The health check executor is the heart of it (with allthe handling we've discussed) and needs to be taken as base.

there will be a bridge already between what goes into felix and theSling HCs in sling

only a temporary bride with very simple impl and a deprecated SPI.Responsibility will be clearly moved to the felix health check module.




On 2018-09-24 12:05, Christian Schneider wrote:

I discussed with Stefan and Georg at adaptto about sling hc and felix
systemready.

For me the main advantage of systemready being at felix is that itattractsa lot more people / projects than a sling subproject. People outsidethe
sling community simply do not use parts of sling for other purposes.
One example of this is that Kai Kreuzer from Openhab approached me to
discuss how systemready could fit for openhab. We will also discusswithPeter Kriens at Eclipsecon how the aggregate state service overlapswithsystemready. So I think actually sling hc would have been a good casefor
bringing to felix from the start.
So I would like to extend to felix systemready project to learn fromslinghc and add some of the features there too. I think the most importantthingare tags and a solid model for executors. I would be happy about anyhelp
with this from the sling community side.

As some people already use sling hc with load balancers I think it also
makes sense to allow to reuse sling health checks in system ready.
Another question is if we want to add felix systemready to the slingdistro
at some point. Would the sling community be interested in this?

Christian


Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
[email protected]>:
- currently there is some overlap between sling health checks and thenew
felix system readyness framework presented [1]
- the idea is to bring this together within felix
- provide a facade for the sling healthcheck API for backwards
compatibility

stefan

[1]
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
--

Re: [hackathon] health checks

Reply via email to