Joerg Hoh created FELIX-6663:
--------------------------------
Summary: Warn if healthcheck execution takes too long
Key: FELIX-6663
URL: https://issues.apache.org/jira/browse/FELIX-6663
Project: Felix
Issue Type: Task
Components: Health Checks
Affects Versions: healthcheck.core 2.2.0
Reporter: Joerg Hoh
We monitor our system using Felix Healthchecks and require that some
healthchecks are reported OK at least every 5 seconds. For this we configured
the timeout in theĀ HealthCheckOptions to 5 seconds.
But we face rarely the situation that the system goes unhealthy without a
healthcheck being executed. It even seems that none of the required healthcheck
is executed during that time at all.
I already ruled out a few obvious cases (full GC, maxed out CPU), but I still
have a few cases which I cannot explain yet. Also while checking the code, I
found that on every invocation of the HealthcheckExecutor.execute() all
metadata for the healthchecks are collected, which require access to the OSGI
Service registry. My application also has situation where a lot of access to
the Service registry happens, which can suffer from lock contention under load,
and that is not included into the timeout calculation of the of the
healthchecks.
As a first step I would like to add some more logging in case the overall
execution of the healthchecks exceed the configured timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)