Hi Christian, hi Andrei,
after reading through the comments, the most important points (as a
summary) first:
* Health Checks are already used by many deployments for load balancers
in order to not have to have to manually reconfigure LBs during
production deployments (I will not post a list of blue chip companies in
open source mailing list though).
* I sense agreement to take Health Checks to Felix, this is good :). HCs
are a proven technology that cover the exact same use case as
systemready and are more mature (having been around for 5 years).
* HCs today are ready to be used with Kubernetes and ootb AEM, just
configure the HC servlet [1], and define a tag (e.g. "systemready") by
adding it to InactiveBundlesHealthCheck and any other checks you need
for this to it. When using a composite nodestore setup with Docker, just
add the OSGi configs for the servlet and the configs for the "tag
amendments" (using prop "hc.tags") to the provisioning model - done. To
ensure you get 5x response just configure kubernetes probes [2] with
http urls like /system/health/systemready.txt?httpStatus=CRITICAL:503
(note that passing in query parameters for Kubernetes did not always
work, but since 2016 it does [3])
* We really have to make sure that we end up with exactly one SPI
interface to provide checks. The current HC interfaces was discussed in
lengths when we introduced it. There is a good reason why we don't have
a getName() method and rather use OSGi property "hc.name"
(reconfigurability).
* Having had a close look at systemready and knowing HCs very well
(having written a fair share of the code), I absolutely think it is
necessary to start with the health check as a base and merge in ideas
from systemready (and not the other way round) - this was also Justin
Edelson's initial response to this thread.
I will answer all other questions below [4].
Best Regards
Georg
[1]
https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet
[2]
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
[3] https://github.com/kubernetes/kubernetes/pull/25064
[4]
Sling health checks support the concept of tags which allows to
configure the special meaning of readiness and liveness as tags. So I
think technically the HC framework should be able to cover our case
too.
exactly
So I would like to extend to felix systemready project to learn from
sling hc and add some of the features there too. I think the most
important thing are tags and a solid model for executors. I would be
happy about any help with this from the sling community side.
We really have to start with the Sling HC code and merge in system ready
aspects
Another question is if we want to add felix systemready to the sling
distro at some point. Would the sling community be interested in this?
yes, the felix health check should be added to the distro in the same
way as the Sling HC is today
... there are ootb healthchecks in AEM [9] and they are NOT used, to my
knowledge, for the load balancer use case.
you rarely run all checks, you always run the checks for a particular
tag you are interested in
sling HCs are used for the LB directing traffic... You say there are
many. Could you share some examples?
Commonly used for production deployments (I work for a integration
partner, we use this for all projects across all clients, but also
others use it as there was many talks at conferences about it)
I understand the optimization aprt, and while systemready might clearly
need some optimizations, I personally don't see it as a reasonable
concern. Kubernetes, for example, retries the liveness and readiness
checks a few times before deciding to act. Do 50ms actually matter
here?
yes, 50ms matter (I created this issue after an operations department
refused to use this for its bad performance, since it is fixed they were
happy :)
parallel execution is not necessarily
After my 5 years long experience: parallel execution is absolutely
necessary, otherwise response times get to long.
Async is not an issue
I don't think async is ideal for what you are doing at the moment (the
current systemready impl with default config possibly delays the correct
result for 5 sec, this is not a good idea IMHO)
There is no separate api bundle yet
True. Somebody needs to explain to me why that is a big deal for a very
small tool (I'm not that experienced with that matter)
See SLING-6773
Yes, timeouts can make it easier to not _accidentally_ do something
bad, but then we're opening a new dimension of complexity. What happens
if a check times out?...
We have discussed this in detail some years ago, we have a good solution
(WARN by default, CRITICAL after a configurable time). Note: In the HC
world you don't take instances offline for WARN, only for CRITICAL.
...developers for the platform will be confused about the two SPI
interfaces HealthCheck [7] and SystemReadyCheck [8], there will be
many unnecessary discussions around when to use which one
I don't really agree about the reasoning. If we do make a bridge, they
are layered and we can keep options open.
Please no bridge and no duplicate SPI interface! What option would you
keep open? I cannot think of anything. Please note that the functional
scope of HCs are fully covered by HCs. The AEM platform has suffered
numerous times of the "too many options problem" - I work at a service
provider and know exactly how much time is completely wasted by people
discussing all these different options. Please note the problem is at
scale: It will affect thousands of developers!
But I actually agree about moving them to Felix, for slightly different
reasons, which is exposure and decoupling.
great :)
It's being used in AEM already (alpha, beta).
I think you should try using ootb health checks as described at the top
of this email.
I respectfully disagree about the KISS part - if anything systemready
is KISS - as simple as possible, disregarding limitations that don't
matter for the single usecase it covers. But I actually agree a bridge
per se is not an ideal solution.
Bridges are not KISS but ugly (extra code, hard to
understand/troubleshoot, extra code/bugs). For systemready being KISS:
yes it's easy, but it does not help being KISS while disregarding some
important parts. HCs are KISS in a way that they solve the problem in
the easiest possible way (I believe).
But what Stefan was saying doesn't match what you're proposing and what
you're proposing is not part of the -decision- consensus you reached
during the hackathon. Or did I misunderstand?
Stefan's wording maybe wasn't perfect. But the agreement at the
Hackathon was to move Sling HC to Felix and merge useful things from
systemready in using Sling HCs as base.
wouldn't it make more sense to have the Sling HCs codebase *extend*
systemready?
This won't work. The health check executor is the heart of it (with all
the handling we've discussed) and needs to be taken as base.
there will be a bridge already between what goes into felix and the
Sling HCs in sling
only a temporary bride with very simple impl and a deprecated SPI.
Responsibility will be clearly moved to the felix health check module.
On 2018-09-24 12:05, Christian Schneider wrote:
I discussed with Stefan and Georg at adaptto about sling hc and felix
systemready.
For me the main advantage of systemready being at felix is that it
attracts
a lot more people / projects than a sling subproject. People outside
the
sling community simply do not use parts of sling for other purposes.
One example of this is that Kai Kreuzer from Openhab approached me to
discuss how systemready could fit for openhab. We will also discuss
with
Peter Kriens at Eclipsecon how the aggregate state service overlaps
with
systemready. So I think actually sling hc would have been a good case
for
bringing to felix from the start.
So I would like to extend to felix systemready project to learn from
sling
hc and add some of the features there too. I think the most important
thing
are tags and a solid model for executors. I would be happy about any
help
with this from the sling community side.
As some people already use sling hc with load balancers I think it also
makes sense to allow to reuse sling health checks in system ready.
Another question is if we want to add felix systemready to the sling
distro
at some point. Would the sling community be interested in this?
Christian
Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
sseif...@pro-vision.de>:
- currently there is some overlap between sling health checks and the
new
felix system readyness framework presented [1]
- the idea is to bring this together within felix
- provide a facade for the sling healthcheck API for backwards
compatibility
stefan
[1]
https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
--