[
https://issues.apache.org/jira/browse/SLING-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010708#comment-16010708
]
Clinton H Goudie-Nice commented on SLING-6855:
----------------------------------------------
[~bdelacretaz]
> I'm missing a way to select the results stored in the registry by tags, how
> do you see that working?
[~henzlerg]
This is a really good point that I haven't considered. I will take a look today.
> Could it be easier/better to just record metrics?
I want to approach the notion of mixing metrics and health checks with caution.
The driving use case for this is a thread about unindexed JCR queries. In the
event a node traversal is encountered, the system should warn about it for 20
minutes (or so) to ensure a human looks at the unindexed query and builds an
index, or changes the query itself.
This might sound like a metric, but it's actually a system performance damaging
event.
This service could:
1) implement a health check of it's own
2) Have a timestamp the last time an unindexed query was tripped with some
descriptive information.
3) Report a failure if the timestamp is before now.
It could have it's own implementation of this ResultRegistry; and this
boilerplate will be duplicated many, many times across OAK, Sling, etc..
I find this a clearly generalizable pattern, and this results in me needing to
@Reference ResultRegistry health; and then I am able to easily call
Calendar c = Calendar.getInstance();
c.add(Calendar.HOUR, 1);
health.put(this.getClass().getName() + ":myMethod", new
Result(Result.Status.WARN, "Unindexed query {somequery} encountered");
An additional use case. If the event queue for Sling or OAK overflow, we
experience data loss, and performance greatly degrades.
With the result registry, the use case is a 2 liner instead of many lines:
@Reference ResultRegistry health;
health.put(this.getClass().getName() + ":eventProcessing", new
Result(Result.Status.CRITICAL, "Event pool overflowing. Please identify the
cause and restart this JVM as soon as possible", null);
With these 2 examples, neither are metrics, both are failing health checks.
They could be implemented using some boilerplate much like the ResultRegistry.
My goal here is to make as little boiler-plate as possible, and lower the bar
for engineers who have clear in-flight event that need to fail to quickly
report them.
> Create ResultRegistry to provide health check behavior for executing code
> that does not want a HealthCheck
> ----------------------------------------------------------------------------------------------------------
>
> Key: SLING-6855
> URL: https://issues.apache.org/jira/browse/SLING-6855
> Project: Sling
> Issue Type: New Feature
> Components: Health Check
> Reporter: Clinton H Goudie-Nice
>
> I want to provide a Registry service that can be leveraged to provide health
> check results.
> These results can be for a period of time through an expiration, until the
> JVM is restarted, or added and later removed.
> This can be useful when code observes a specific (possibly bad) state, and
> wants to alert through the health check API that this state has taken place.
> Some examples:
> An event pool has filled, and some events will be thrown away.
> This is a failure case that requires a restart of the instance.
> It would be appropriate to trigger a permanent failure.
>
> A quota has been tripped. This quota may immediately recover, but it is
> sensible to alert for 30 minutes that the quota has been tripped.
> If you expect the failure will clear itself within a certain window, setting
> the expiration to that window can be ideal.
> GHPR to follow
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)