[jira] [Comment Edited] (SLING-6855) Create ResultRegistry to provide health check behavior for executing code that does not want a HealthCheck

Clinton H Goudie-Nice (JIRA) Mon, 15 May 2017 08:34:33 -0700

    [ 
https://issues.apache.org/jira/browse/SLING-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010708#comment-16010708
 ]


Clinton H Goudie-Nice edited comment on SLING-6855 at 5/15/17 3:33 PM:
-----------------------------------------------------------------------

[~bdelacretaz]
> I'm missing a way to select the results stored in the registry by tags, how 
> do you see that working?

This is a really good point that I haven't considered. I will take a look today.

[~henzlerg]

> Could it be easier/better to just record metrics? 
I want to approach the notion of mixing metrics and health checks with caution. 

The driving use case for this is a thread about unindexed JCR queries. In the 
event a node traversal is encountered, the system should warn about it for 20 
minutes (or so) to ensure a human looks at the unindexed query and builds an 
index, or changes the query itself.

This might sound like a metric, but it's actually a system performance damaging 
event.

This service could:
1) implement a health check of it's own
2) Have a timestamp the last time an unindexed query was tripped with some 
descriptive information.
3) Report a failure if the timestamp is before now.

It could have it's own implementation of this ResultRegistry; and this 
boilerplate will be duplicated many, many times across OAK, Sling, etc..

I find this a clearly generalizable pattern, and this results in me needing to 
@Reference ResultRegistry health; and then I am able to easily call
Calendar c = Calendar.getInstance();
c.add(Calendar.HOUR, 1);
health.put(this.getClass().getName() + ":myMethod", new 
Result(Result.Status.WARN, "Unindexed query {somequery} encountered");


An additional use case. If the event queue for Sling or OAK overflow, we 
experience data loss, and performance greatly degrades.

With the result registry, the use case is a 2 liner instead of many lines:

@Reference ResultRegistry health;
health.put(this.getClass().getName() + ":eventProcessing", new 
Result(Result.Status.CRITICAL, "Event pool overflowing. Please identify the 
cause and restart this JVM as soon as possible", null);

With these 2 examples, neither are metrics, both are failing health checks. 
They could be implemented using some boilerplate much like the ResultRegistry. 


My goal here is to make as little boiler-plate as possible, and lower the bar 
for engineers who have clear in-flight event that need to fail to quickly 
report them.


was (Author: cgoudie):
[~bdelacretaz]
> I'm missing a way to select the results stored in the registry by tags, how 
> do you see that working?

[~henzlerg]
This is a really good point that I haven't considered. I will take a look today.

> Could it be easier/better to just record metrics? 
I want to approach the notion of mixing metrics and health checks with caution. 

The driving use case for this is a thread about unindexed JCR queries. In the 
event a node traversal is encountered, the system should warn about it for 20 
minutes (or so) to ensure a human looks at the unindexed query and builds an 
index, or changes the query itself.

This might sound like a metric, but it's actually a system performance damaging 
event.

This service could:
1) implement a health check of it's own
2) Have a timestamp the last time an unindexed query was tripped with some 
descriptive information.
3) Report a failure if the timestamp is before now.

It could have it's own implementation of this ResultRegistry; and this 
boilerplate will be duplicated many, many times across OAK, Sling, etc..

I find this a clearly generalizable pattern, and this results in me needing to 
@Reference ResultRegistry health; and then I am able to easily call
Calendar c = Calendar.getInstance();
c.add(Calendar.HOUR, 1);
health.put(this.getClass().getName() + ":myMethod", new 
Result(Result.Status.WARN, "Unindexed query {somequery} encountered");


An additional use case. If the event queue for Sling or OAK overflow, we 
experience data loss, and performance greatly degrades.

With the result registry, the use case is a 2 liner instead of many lines:

@Reference ResultRegistry health;
health.put(this.getClass().getName() + ":eventProcessing", new 
Result(Result.Status.CRITICAL, "Event pool overflowing. Please identify the 
cause and restart this JVM as soon as possible", null);

With these 2 examples, neither are metrics, both are failing health checks. 
They could be implemented using some boilerplate much like the ResultRegistry. 


My goal here is to make as little boiler-plate as possible, and lower the bar 
for engineers who have clear in-flight event that need to fail to quickly 
report them.

> Create ResultRegistry to provide health check behavior for executing code 
> that does not want a HealthCheck
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SLING-6855
>                 URL: https://issues.apache.org/jira/browse/SLING-6855
>             Project: Sling
>          Issue Type: New Feature
>          Components: Health Check
>            Reporter: Clinton H Goudie-Nice
>
> I want to provide a Registry service that can be leveraged to provide health 
> check results.
> These results can be for a period of time through an expiration, until the 
> JVM is restarted, or added and later removed.
> This can be useful when code observes a specific (possibly bad) state, and 
> wants to alert through the health check API that this state has taken place.
>  Some examples: 
>  An event pool has filled, and some events will be thrown away.
>   This is a failure case that requires a restart of the instance.
>   It would be appropriate to trigger a permanent failure.
>    
>  A quota has been tripped. This quota may immediately recover, but it is 
> sensible to alert for 30 minutes that the quota has been tripped.
>  If you expect the failure will clear itself within a certain window, setting 
> the expiration to that window can be ideal.
> GHPR to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (SLING-6855) Create ResultRegistry to provide health check behavior for executing code that does not want a HealthCheck

Reply via email to