[ 
https://issues.apache.org/jira/browse/FELIX-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781032#comment-17781032
 ] 

Georg Henzler commented on FELIX-6663:
--------------------------------------

[~joerghoh] Access to the OSGI Service registry is synchronous (and expected to 
be very quick!), if there is congestion there the configured timeouts would not 
make a difference but the request to the HC servlet would just return later 
(the timeout is taking 
[HealthCheckExecutorImpl.java#L383|https://github.com/apache/felix-dev/blob/5b9162d13ffa750d86a240d4c9b41645d511c72c/healthcheck/core/src/main/java/org/apache/felix/hc/core/impl/executor/HealthCheckExecutorImpl.java#L383]
 as starting point). 

So if we assume a congestion happen in OSGI Service registry for let's say 
10sec and the timeout of the HC is 5 sec, the executor (and the servlet calling 
it) would return after 15sec and most importantly, the HCs would still be 
called. But maybe there is also a timeout on the caller side that makes it look 
like an "empty result". But I agree the PR will be useful to analyse further, I 
have just merged it.



> Warn if healthcheck execution takes too long
> --------------------------------------------
>
>                 Key: FELIX-6663
>                 URL: https://issues.apache.org/jira/browse/FELIX-6663
>             Project: Felix
>          Issue Type: Task
>          Components: Health Checks
>    Affects Versions: healthcheck.core 2.2.0
>            Reporter: Joerg Hoh
>            Priority: Major
>
> We monitor our system using Felix Healthchecks and require that some 
> healthchecks are reported OK at least every 5 seconds. For this we configured 
> the timeout in theĀ  HealthCheckOptions to 5 seconds.
> But we face rarely the situation that the system goes unhealthy without a 
> healthcheck being executed. It even seems that none of the required 
> healthcheck is executed during that time at all.
> I already ruled out a few obvious cases (full GC, maxed out CPU), but I still 
> have a few cases which I cannot explain yet. Also while checking the code, I 
> found that on every invocation of the HealthcheckExecutor.execute() all 
> metadata for the healthchecks are collected, which require access to the OSGI 
> Service registry. My application also has situation where a lot of access to 
> the Service registry happens, which can suffer from lock contention under 
> load, and that is not included into the timeout calculation of the of the 
> healthchecks.
> As a first step I would like to add some more logging in case the overall 
> execution of the healthchecks exceed the configured timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to