Hi all,

I implemented a health check infrastructure + a whole set of checks a year ago (in production now). Now this year I found out that there is effort being made on the Sling side [1]. I just had a closer look at the Sling code and I like some of the concepts but believe some other things could maybe be improved. I'd be happy to contribute some parts from my side if the Sling Community is interested. I roughly give an overview over the two approaches:

****** "My HC Infrastructure"
* Health Checks are OSGi Components that implement an interface, almost exactly in line with org.apache.sling.hc.api.HealthCheck * There is an emphasis on getting the overall status of the system: There is a Web Console Plugin and whiteboard servlet (not being dependent on sling) to retrieve an aggregated result of all health checks registered as services * The result of an individual health check can be RED, AMBER or GREEN - the overall result is the worst result found * The servlet allows to retrieve result of all checks in html, json and jsonp (contains overall result + result of each check in a structured, machine-readable format) * There are custom checks for the project to make sure a few SOAP and REST are available - if they fail they return AMBER. AMBER means someone should pay attention, but the system itself is still stable. * All individual checks are executed in parallel by a class HealthCheckRunner (using Futures/ExecutorService under the hood). The advantage is that the overall result can alwasy quickly be calculated (especially the latency in the SAOP/REST checks required this!). The HealthCheckRunner makes sure that the threads being used for this are limited to the no of registered health checks and if one check hangs it handles it correctly with timeout settings in the OSGi console (it took a while to get rid of all problems of parallel execution, but now it's rock-solid, the only downside being some extra threads/memory required). * The health check is used by a monitoring system on customer side (similar to Nagios) * The health check servlet has a parameter to return HTTP 500 if the overall result is RED, this is used by the load balancer of the publisher servers to automatically take out failing instances (this is only possible because of the parallel execution) * I had a jenkins plugin in place to show an overview page of 10+ CQ instances using the JSON results (DEV/TEST/INT AUTHOR/PUBLISH etc.)
* No JMX integration

****** SLING Health Check (as of today)
* Core defines API and some utility classes. The result contains log entries for each check.
* The core itself is not able to run checks (if I got that right)
* Tags are used to be able to run a group of checks (I quite like this!) * The Web Console Plugin gives a nice interface to humans to run all checks for a given tag (or all checks if the tags are omitted). Execution is sequentially and can potentially take a long time (depends really on checks) * JMX allows to get the status of a certain health check, but it is not possible to retrieve an overall status via JMX (if I got that right) * There is no way to retrieve an overall result in JSON (if I got that right) * There is an example for async execution (AsyncHealthCheckSample) - however this aspect needs to be implemented for every check in need for asnync execution again

As a first step, I would like to propose the following:
* Introduce HealthCheckRunner to hc-core  with the following signature:
List<Result> HealthCheckRunner.runAllForTags(String... tags) // the list is sorted to put failed ones always on top * The HealthCheckRunner would use the existing class HealthCheckFilter to retrieve the service references
* The Web Console would be adjusted to use HealthCheckRunner
* I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result
* Add parameter format=json to /system/console/healthcheck to provide the result in JSON format (to avoid an extra servlet, I think it is possible for console urls to return JSON but I would have to check)

Let me know what you think - as everything is there already I could fairly quickly provide a patch for this (but I only make the effort to create one if you think it's valuable).

Regards
Georg

[1]
http://www.slideshare.net/bdelacretaz/slinghc-bdelacretazadaptto2013
http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
https://issues.apache.org/jira/browse/SLING/component/12320832

Reply via email to