Re: Long running health checks, jmx registration and concurrent invocation

Bertrand Delacretaz Tue, 22 Oct 2013 05:40:42 -0700

Hi,

On Tue, Oct 22, 2013 at 1:06 PM, Carsten Ziegeler <[email protected]> wrote:
> ...According to the API health checks are considered to execute quickly -
> which is fine. However there is no prevention against it. I'm not sure if
> we should do this, but e.g. the EventAdmin blacklists long running health
> checks after their first invocation...


I agree that preventing slow HealthCheck.execute() methods is a good idea.

> ...This gets even more tricky as health checks are registered as mbeans with
> only attributes and no methods...

Right, enforcing fast execution looks like the right thing to do.

> ...All of this can be solved easily, if we stick to "health check execution
> should be fast and not expensive". In that case we might add black listing.
> Things like a progress bar etc. have to be done through whatever mechanism
> is used to execute the hc asynchronously....

Instead of permanent blacklisting I'd suggest returning a normal
Result but with a TIMEOUT status.

For example, a health check that causes lots of initializations (maybe
because it's called right after Sling startup) might be quite slow on
the first call, and then fast, so TIMEOUT on first call and actual
results (maybe computed asynchronously) later makes sense, but
permanent blacklisting would get in the way.

I suggest implementing a timeout on the HealthCheck.execute() method
(not sure how - HealthCheckExecutor service maybe) which returns a
Result with a TIMEOUT state, that indicates how long the timeout was,
and maybe a short term blacklisting of the HealthCheck, during which
it returns a BLACKLISTED state result.

WDYT?

-Bertrand

Re: Long running health checks, jmx registration and concurrent invocation

Reply via email to