Hi Christian, hi Andrei,

after reading through the comments, the most important points (as a summary) first:

* Health Checks are already used by many deployments for load balancers in order to not have to have to manually reconfigure LBs during production deployments (I will not post a list of blue chip companies in open source mailing list though).

* I sense agreement to take Health Checks to Felix, this is good :). HCs are a proven technology that cover the exact same use case as systemready and are more mature (having been around for 5 years).

* HCs today are ready to be used with Kubernetes and ootb AEM, just configure the HC servlet [1], and define a tag (e.g. "systemready") by adding it to InactiveBundlesHealthCheck and any other checks you need for this to it. When using a composite nodestore setup with Docker, just add the OSGi configs for the servlet and the configs for the "tag amendments" (using prop "hc.tags") to the provisioning model - done. To ensure you get 5x response just configure kubernetes probes [2] with http urls like /system/health/systemready.txt?httpStatus=CRITICAL:503 (note that passing in query parameters for Kubernetes did not always work, but since 2016 it does [3])

* We really have to make sure that we end up with exactly one SPI interface to provide checks. The current HC interfaces was discussed in lengths when we introduced it. There is a good reason why we don't have a getName() method and rather use OSGi property "hc.name" (reconfigurability).

* Having had a close look at systemready and knowing HCs very well (having written a fair share of the code), I absolutely think it is necessary to start with the health check as a base and merge in ideas from systemready (and not the other way round) - this was also Justin Edelson's initial response to this thread.

I will answer all other questions below [4].

Best Regards
Georg

[1] https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet [2] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
[3] https://github.com/kubernetes/kubernetes/pull/25064


[4]

Sling health checks support the concept of tags which allows to configure the special meaning of readiness and liveness as tags. So I think technically the HC framework should be able to cover our case too.

exactly

So I would like to extend to felix systemready project to learn from sling hc and add some of the features there too. I think the most important thing are tags and a solid model for executors. I would be happy about any help with this from the sling community side.

We really have to start with the Sling HC code and merge in system ready aspects

Another question is if we want to add felix systemready to the sling distro at some point. Would the sling community be interested in this?

yes, the felix health check should be added to the distro in the same way as the Sling HC is today

... there are ootb healthchecks in AEM [9] and they are NOT used, to my knowledge, for the load balancer use case.

you rarely run all checks, you always run the checks for a particular tag you are interested in

sling HCs are used for the LB directing traffic... You say there are many. Could you share some examples?

Commonly used for production deployments (I work for a integration partner, we use this for all projects across all clients, but also others use it as there was many talks at conferences about it)

I understand the optimization aprt, and while systemready might clearly need some optimizations, I personally don't see it as a reasonable concern. Kubernetes, for example, retries the liveness and readiness checks a few times before deciding to act. Do 50ms actually matter here?

yes, 50ms matter (I created this issue after an operations department refused to use this for its bad performance, since it is fixed they were happy :)

parallel execution is not necessarily

After my 5 years long experience: parallel execution is absolutely necessary, otherwise response times get to long.

Async is not an issue

I don't think async is ideal for what you are doing at the moment (the current systemready impl with default config possibly delays the correct result for 5 sec, this is not a good idea IMHO)

There is no separate api bundle yet
True. Somebody needs to explain to me why that is a big deal for a very small tool (I'm not that experienced with that matter)

See SLING-6773

Yes, timeouts can make it easier to not _accidentally_ do something bad, but then we're opening a new dimension of complexity. What happens if a check times out?...

We have discussed this in detail some years ago, we have a good solution (WARN by default, CRITICAL after a configurable time). Note: In the HC world you don't take instances offline for WARN, only for CRITICAL.

...developers for the platform will be confused about the two SPI interfaces HealthCheck [7] and SystemReadyCheck [8], there will be many unnecessary discussions around when to use which one
I don't really agree about the reasoning. If we do make a bridge, they are layered and we can keep options open.

Please no bridge and no duplicate SPI interface! What option would you keep open? I cannot think of anything. Please note that the functional scope of HCs are fully covered by HCs. The AEM platform has suffered numerous times of the "too many options problem" - I work at a service provider and know exactly how much time is completely wasted by people discussing all these different options. Please note the problem is at scale: It will affect thousands of developers!

But I actually agree about moving them to Felix, for slightly different reasons, which is exposure and decoupling.

great :)

It's being used in AEM already (alpha, beta).

I think you should try using ootb health checks as described at the top of this email.

I respectfully disagree about the KISS part - if anything systemready is KISS - as simple as possible, disregarding limitations that don't matter for the single usecase it covers. But I actually agree a bridge per se is not an ideal solution.

Bridges are not KISS but ugly (extra code, hard to understand/troubleshoot, extra code/bugs). For systemready being KISS: yes it's easy, but it does not help being KISS while disregarding some important parts. HCs are KISS in a way that they solve the problem in the easiest possible way (I believe).

But what Stefan was saying doesn't match what you're proposing and what you're proposing is not part of the -decision- consensus you reached during the hackathon. Or did I misunderstand?

Stefan's wording maybe wasn't perfect. But the agreement at the Hackathon was to move Sling HC to Felix and merge useful things from systemready in using Sling HCs as base.

wouldn't it make more sense to have the Sling HCs codebase *extend* systemready?

This won't work. The health check executor is the heart of it (with all the handling we've discussed) and needs to be taken as base.

there will be a bridge already between what goes into felix and the Sling HCs in sling

only a temporary bride with very simple impl and a deprecated SPI. Responsibility will be clearly moved to the felix health check module.



On 2018-09-24 12:05, Christian Schneider wrote:
I discussed with Stefan and Georg at adaptto about sling hc and felix
systemready.



For me the main advantage of systemready being at felix is that it attracts a lot more people / projects than a sling subproject. People outside the
sling community simply do not use parts of sling for other purposes.
One example of this is that Kai Kreuzer from Openhab approached me to
discuss how systemready could fit for openhab. We will also discuss with Peter Kriens at Eclipsecon how the aggregate state service overlaps with systemready. So I think actually sling hc would have been a good case for
bringing to felix from the start.

So I would like to extend to felix systemready project to learn from sling hc and add some of the features there too. I think the most important thing are tags and a solid model for executors. I would be happy about any help
with this from the sling community side.

As some people already use sling hc with load balancers I think it also
makes sense to allow to reuse sling health checks in system ready.

Another question is if we want to add felix systemready to the sling distro
at some point. Would the sling community be interested in this?

Christian


Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
sseif...@pro-vision.de>:

- currently there is some overlap between sling health checks and the new
felix system readyness framework presented [1]
- the idea is to bring this together within felix
- provide a facade for the sling healthcheck API for backwards
compatibility

stefan

[1]
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html




--

Reply via email to