Re: [prometheus-users] Discrepancy in Resolved Alerts.

Yagyansh S. Kumar Sat, 18 Apr 2020 07:27:13 -0700

Thanks a lot for the detailed explantion, Brain.
I guess I need to monitor the resolved alerts a bit more closely and then 
take a call.


On Saturday, April 18, 2020 at 3:16:56 PM UTC+5:30, Brian Candler wrote:
>
> I can see two possible issues here.
>
> Firstly, the value of the annotation you see in the resolved messsage is 
> always the value at the time *before* the alert resolved, not the value 
> which is now below the threshold.
>
> Let me simplify your expression to:
>
>     foo > 85
>
> This is a PromQL filter.  In general there could be many timeseries for 
> metric "foo".  If you have ten timeseries, and two of them have values over 
> 85, then the result of this expression is those two timeseries, with their 
> labels and those two values above 85.  But if all the timeseries are below 
> 85, then this expression returns no timeseries, and therefore it has no 
> values.
>
> So: suppose one "foo" timeseries goes up to 90 for long enough to trigger 
> the alert (for: 2m).  You will get an alert with annotation:
>
> description: Current value = 90
>
> Maybe then it goes up to 95 for a while.  You don't get a new notification 
> except in certain circumances (group_interval etc).
>
> When the value of foo drops below the threshold, say to 70, then the alert 
> ceases to exist.  Alertmanager sends out a "resolved" message with all the 
> labels and annotations of the alert as it was *when it last existed*, i.e.
>
> description: Current value = 95
>
> There's nothing else it can do.  The "expr" in the alerting rule returns 
> no timeseries, which means no values and no labels.  You can't create an 
> annotation for an alert that doesn't exist.
>
> It's for this reason that I removed all my alert annotations which had 
> $value in them, since the Resolved messages are confusing.  However you 
> could instead change them to something more verbose, e.g.
>
> description: Most recent triggering value = 95
>
> The second issue is, is it possible the value dipped below the threshold 
> for one rule evaluation interval?
>
> Prometheus does debouncing in one direction (the alert must be constantly 
> active "for: 2m" before it goes from Pending into Firing), but not in the 
> other direction. A single dip below the threshold and it will resolve 
> immediately, and then it could go into Pending then Firing again.  You 
> would see that as a resolved followed by a new alert.
>
> There is a closed issue for alertmanager debouncing / flap detection here:
> https://github.com/prometheus/alertmanager/issues/204
>
> Personally I think prometheus itself should have a "Resolving" state 
> analogous to "Pending", so a brief trip below the threshold doesn't 
> instantly resolve - but like I say, that issue is closed.
>
> HTH,
>
> Brian.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f5b9954e-10c0-4021-a95b-a819f1beb200%40googlegroups.com.

Re: [prometheus-users] Discrepancy in Resolved Alerts.

Reply via email to