Hi,
The goal is to declare health check results that remain valid for a
specified time or forever.
So I agree metrics as proposed in comment [1] cannot achieve this
(limited to 1, 5 and 15 minutes time windows). However I still think a
purely declarative approach is cleaner and will lead to more consistency
across HCs: We could introduce a HC property "hc.keepWarnStickyForMin"
(and "hc.keepCriticalStickyForMin") - this can be entirely implemented
in the impl package and would not require a new API. For the "Event
queue overflown" example the property
hc.keepWarnStickyForMin=Integer.MAX_VALUE could be set, the HC executor
could then append a result as follows:
INFO Checking Event Queue...
INFO Event Queue is currently fine.
WARN --- Sticky result from 2017-06-07 11:49 ---
INFO Checking Event Queue...
WARN Event Queue overloaded!
This means the full log of both the current result and a historic sticky
result would be shown (the timeout handling works similar already, if a
HC times out the last available HC result is shown). The HC executor has
all necessary meta data (the time is recorded in the execution result)
and this would be easy to add. The best about this is that you can
change the sticky time and the "stickiness" by configuration only - no
redeployment needed :)
WDYT?
Best Regards
Georg
[1]
https://issues.apache.org/jira/browse/SLING-6855?focusedCommentId=16010189&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16010189
For example, a quota has been tripped - warn for 30 minutes.
Or an events queue overflowed and the instance is considered damaged -
raise a critical alarm forever.
With the current SLING-6855 one can raise such alarms but they are all
grouped in a single health check - doing this results in that HC
having both A and B tags and returning two results:
ResultRegistry reg = sling.getService(ResultRegistry.class)
reg.put("testA", new Result(Result.Status.CRITICAL, "It's
critical"), null, "A");
reg.put("testB", new Result(Result.Status.WARN, "B is just a
warning"), null, "B");
So if you query for tag B you get both results, although they are
unrelated.
I would prefer creating one HC for each such alarm, and rename the
service StickyResults instead of ResultRegistry.
So the above example (with service interface renamed) would cause two
HCs to be created:
1) StickyResult (testA) ; status CRITICAL, message "it's critical", tag
A
2) StickyResult (testB) ; status WARN, message "B is just a warning",
tag B
The HCs are keyed based on the "identifier" parameter, so in the above
example putting another "testB" overwrites the existing one.
Clint and others, WDYT?
-Bertrand