Hi all,
I implemented a health check infrastructure + a whole set of checks a
year ago (in production now). Now this year I found out that there is
effort being made on the Sling side [1]. I just had a closer look at the
Sling code and I like some of the concepts but believe some other things
could maybe be improved. I'd be happy to contribute some parts from my
side if the Sling Community is interested. I roughly give an overview
over the two approaches:
****** "My HC Infrastructure"
* Health Checks are OSGi Components that implement an interface, almost
exactly in line with org.apache.sling.hc.api.HealthCheck
* There is an emphasis on getting the overall status of the system:
There is a Web Console Plugin and whiteboard servlet (not being
dependent on sling) to retrieve an aggregated result of all health
checks registered as services
* The result of an individual health check can be RED, AMBER or GREEN -
the overall result is the worst result found
* The servlet allows to retrieve result of all checks in html, json and
jsonp (contains overall result + result of each check in a structured,
machine-readable format)
* There are custom checks for the project to make sure a few SOAP and
REST are available - if they fail they return AMBER. AMBER means someone
should pay attention, but the system itself is still stable.
* All individual checks are executed in parallel by a class
HealthCheckRunner (using Futures/ExecutorService under the hood). The
advantage is that the overall result can alwasy quickly be calculated
(especially the latency in the SAOP/REST checks required this!). The
HealthCheckRunner makes sure that the threads being used for this are
limited to the no of registered health checks and if one check hangs it
handles it correctly with timeout settings in the OSGi console (it took
a while to get rid of all problems of parallel execution, but now it's
rock-solid, the only downside being some extra threads/memory required).
* The health check is used by a monitoring system on customer side
(similar to Nagios)
* The health check servlet has a parameter to return HTTP 500 if the
overall result is RED, this is used by the load balancer of the
publisher servers to automatically take out failing instances (this is
only possible because of the parallel execution)
* I had a jenkins plugin in place to show an overview page of 10+ CQ
instances using the JSON results (DEV/TEST/INT AUTHOR/PUBLISH etc.)
* No JMX integration
****** SLING Health Check (as of today)
* Core defines API and some utility classes. The result contains log
entries for each check.
* The core itself is not able to run checks (if I got that right)
* Tags are used to be able to run a group of checks (I quite like
this!)
* The Web Console Plugin gives a nice interface to humans to run all
checks for a given tag (or all checks if the tags are omitted).
Execution is sequentially and can potentially take a long time (depends
really on checks)
* JMX allows to get the status of a certain health check, but it is not
possible to retrieve an overall status via JMX (if I got that right)
* There is no way to retrieve an overall result in JSON (if I got that
right)
* There is an example for async execution (AsyncHealthCheckSample) -
however this aspect needs to be implemented for every check in need for
asnync execution again
As a first step, I would like to propose the following:
* Introduce HealthCheckRunner to hc-core with the following signature:
List<Result> HealthCheckRunner.runAllForTags(String... tags) //
the list is sorted to put failed ones always on top
* The HealthCheckRunner would use the existing class HealthCheckFilter
to retrieve the service references
* The Web Console would be adjusted to use HealthCheckRunner
* I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result
* Add parameter format=json to /system/console/healthcheck to provide
the result in JSON format (to avoid an extra servlet, I think it is
possible for console urls to return JSON but I would have to check)
Let me know what you think - as everything is there already I could
fairly quickly provide a patch for this (but I only make the effort to
create one if you think it's valuable).
Regards
Georg
[1]
http://www.slideshare.net/bdelacretaz/slinghc-bdelacretazadaptto2013
http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
https://issues.apache.org/jira/browse/SLING/component/12320832