[ https://issues.apache.org/jira/browse/FELIX-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781032#comment-17781032 ]
Georg Henzler commented on FELIX-6663: -------------------------------------- [~joerghoh] Access to the OSGI Service registry is synchronous (and expected to be very quick!), if there is congestion there the configured timeouts would not make a difference but the request to the HC servlet would just return later (the timeout is taking [HealthCheckExecutorImpl.java#L383|https://github.com/apache/felix-dev/blob/5b9162d13ffa750d86a240d4c9b41645d511c72c/healthcheck/core/src/main/java/org/apache/felix/hc/core/impl/executor/HealthCheckExecutorImpl.java#L383] as starting point). So if we assume a congestion happen in OSGI Service registry for let's say 10sec and the timeout of the HC is 5 sec, the executor (and the servlet calling it) would return after 15sec and most importantly, the HCs would still be called. But maybe there is also a timeout on the caller side that makes it look like an "empty result". But I agree the PR will be useful to analyse further, I have just merged it. > Warn if healthcheck execution takes too long > -------------------------------------------- > > Key: FELIX-6663 > URL: https://issues.apache.org/jira/browse/FELIX-6663 > Project: Felix > Issue Type: Task > Components: Health Checks > Affects Versions: healthcheck.core 2.2.0 > Reporter: Joerg Hoh > Priority: Major > > We monitor our system using Felix Healthchecks and require that some > healthchecks are reported OK at least every 5 seconds. For this we configured > the timeout in theĀ HealthCheckOptions to 5 seconds. > But we face rarely the situation that the system goes unhealthy without a > healthcheck being executed. It even seems that none of the required > healthcheck is executed during that time at all. > I already ruled out a few obvious cases (full GC, maxed out CPU), but I still > have a few cases which I cannot explain yet. Also while checking the code, I > found that on every invocation of the HealthcheckExecutor.execute() all > metadata for the healthchecks are collected, which require access to the OSGI > Service registry. My application also has situation where a lot of access to > the Service registry happens, which can suffer from lock contention under > load, and that is not included into the timeout calculation of the of the > healthchecks. > As a first step I would like to add some more logging in case the overall > execution of the healthchecks exceed the configured timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)