[
https://issues.apache.org/jira/browse/FELIX-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781032#comment-17781032
]
Georg Henzler commented on FELIX-6663:
--------------------------------------
[~joerghoh] Access to the OSGI Service registry is synchronous (and expected to
be very quick!), if there is congestion there the configured timeouts would not
make a difference but the request to the HC servlet would just return later
(the timeout is taking
[HealthCheckExecutorImpl.java#L383|https://github.com/apache/felix-dev/blob/5b9162d13ffa750d86a240d4c9b41645d511c72c/healthcheck/core/src/main/java/org/apache/felix/hc/core/impl/executor/HealthCheckExecutorImpl.java#L383]
as starting point).
So if we assume a congestion happen in OSGI Service registry for let's say
10sec and the timeout of the HC is 5 sec, the executor (and the servlet calling
it) would return after 15sec and most importantly, the HCs would still be
called. But maybe there is also a timeout on the caller side that makes it look
like an "empty result". But I agree the PR will be useful to analyse further, I
have just merged it.
> Warn if healthcheck execution takes too long
> --------------------------------------------
>
> Key: FELIX-6663
> URL: https://issues.apache.org/jira/browse/FELIX-6663
> Project: Felix
> Issue Type: Task
> Components: Health Checks
> Affects Versions: healthcheck.core 2.2.0
> Reporter: Joerg Hoh
> Priority: Major
>
> We monitor our system using Felix Healthchecks and require that some
> healthchecks are reported OK at least every 5 seconds. For this we configured
> the timeout in theĀ HealthCheckOptions to 5 seconds.
> But we face rarely the situation that the system goes unhealthy without a
> healthcheck being executed. It even seems that none of the required
> healthcheck is executed during that time at all.
> I already ruled out a few obvious cases (full GC, maxed out CPU), but I still
> have a few cases which I cannot explain yet. Also while checking the code, I
> found that on every invocation of the HealthcheckExecutor.execute() all
> metadata for the healthchecks are collected, which require access to the OSGI
> Service registry. My application also has situation where a lot of access to
> the Service registry happens, which can suffer from lock contention under
> load, and that is not included into the timeout calculation of the of the
> healthchecks.
> As a first step I would like to add some more logging in case the overall
> execution of the healthchecks exceed the configured timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)