[ https://issues.apache.org/jira/browse/SLING-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010708#comment-16010708 ]
Clinton H Goudie-Nice edited comment on SLING-6855 at 5/15/17 3:33 PM: ----------------------------------------------------------------------- [~bdelacretaz] > I'm missing a way to select the results stored in the registry by tags, how > do you see that working? This is a really good point that I haven't considered. I will take a look today. [~henzlerg] > Could it be easier/better to just record metrics? I want to approach the notion of mixing metrics and health checks with caution. The driving use case for this is a thread about unindexed JCR queries. In the event a node traversal is encountered, the system should warn about it for 20 minutes (or so) to ensure a human looks at the unindexed query and builds an index, or changes the query itself. This might sound like a metric, but it's actually a system performance damaging event. This service could: 1) implement a health check of it's own 2) Have a timestamp the last time an unindexed query was tripped with some descriptive information. 3) Report a failure if the timestamp is before now. It could have it's own implementation of this ResultRegistry; and this boilerplate will be duplicated many, many times across OAK, Sling, etc.. I find this a clearly generalizable pattern, and this results in me needing to @Reference ResultRegistry health; and then I am able to easily call Calendar c = Calendar.getInstance(); c.add(Calendar.HOUR, 1); health.put(this.getClass().getName() + ":myMethod", new Result(Result.Status.WARN, "Unindexed query {somequery} encountered"); An additional use case. If the event queue for Sling or OAK overflow, we experience data loss, and performance greatly degrades. With the result registry, the use case is a 2 liner instead of many lines: @Reference ResultRegistry health; health.put(this.getClass().getName() + ":eventProcessing", new Result(Result.Status.CRITICAL, "Event pool overflowing. Please identify the cause and restart this JVM as soon as possible", null); With these 2 examples, neither are metrics, both are failing health checks. They could be implemented using some boilerplate much like the ResultRegistry. My goal here is to make as little boiler-plate as possible, and lower the bar for engineers who have clear in-flight event that need to fail to quickly report them. was (Author: cgoudie): [~bdelacretaz] > I'm missing a way to select the results stored in the registry by tags, how > do you see that working? [~henzlerg] This is a really good point that I haven't considered. I will take a look today. > Could it be easier/better to just record metrics? I want to approach the notion of mixing metrics and health checks with caution. The driving use case for this is a thread about unindexed JCR queries. In the event a node traversal is encountered, the system should warn about it for 20 minutes (or so) to ensure a human looks at the unindexed query and builds an index, or changes the query itself. This might sound like a metric, but it's actually a system performance damaging event. This service could: 1) implement a health check of it's own 2) Have a timestamp the last time an unindexed query was tripped with some descriptive information. 3) Report a failure if the timestamp is before now. It could have it's own implementation of this ResultRegistry; and this boilerplate will be duplicated many, many times across OAK, Sling, etc.. I find this a clearly generalizable pattern, and this results in me needing to @Reference ResultRegistry health; and then I am able to easily call Calendar c = Calendar.getInstance(); c.add(Calendar.HOUR, 1); health.put(this.getClass().getName() + ":myMethod", new Result(Result.Status.WARN, "Unindexed query {somequery} encountered"); An additional use case. If the event queue for Sling or OAK overflow, we experience data loss, and performance greatly degrades. With the result registry, the use case is a 2 liner instead of many lines: @Reference ResultRegistry health; health.put(this.getClass().getName() + ":eventProcessing", new Result(Result.Status.CRITICAL, "Event pool overflowing. Please identify the cause and restart this JVM as soon as possible", null); With these 2 examples, neither are metrics, both are failing health checks. They could be implemented using some boilerplate much like the ResultRegistry. My goal here is to make as little boiler-plate as possible, and lower the bar for engineers who have clear in-flight event that need to fail to quickly report them. > Create ResultRegistry to provide health check behavior for executing code > that does not want a HealthCheck > ---------------------------------------------------------------------------------------------------------- > > Key: SLING-6855 > URL: https://issues.apache.org/jira/browse/SLING-6855 > Project: Sling > Issue Type: New Feature > Components: Health Check > Reporter: Clinton H Goudie-Nice > > I want to provide a Registry service that can be leveraged to provide health > check results. > These results can be for a period of time through an expiration, until the > JVM is restarted, or added and later removed. > This can be useful when code observes a specific (possibly bad) state, and > wants to alert through the health check API that this state has taken place. > Some examples: > An event pool has filled, and some events will be thrown away. > This is a failure case that requires a restart of the instance. > It would be appropriate to trigger a permanent failure. > > A quota has been tripped. This quota may immediately recover, but it is > sensible to alert for 30 minutes that the quota has been tripped. > If you expect the failure will clear itself within a certain window, setting > the expiration to that window can be ideal. > GHPR to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)