Thanks a lot for the detailed explantion, Brain. I guess I need to monitor the resolved alerts a bit more closely and then take a call.
On Saturday, April 18, 2020 at 3:16:56 PM UTC+5:30, Brian Candler wrote: > > I can see two possible issues here. > > Firstly, the value of the annotation you see in the resolved messsage is > always the value at the time *before* the alert resolved, not the value > which is now below the threshold. > > Let me simplify your expression to: > > foo > 85 > > This is a PromQL filter. In general there could be many timeseries for > metric "foo". If you have ten timeseries, and two of them have values over > 85, then the result of this expression is those two timeseries, with their > labels and those two values above 85. But if all the timeseries are below > 85, then this expression returns no timeseries, and therefore it has no > values. > > So: suppose one "foo" timeseries goes up to 90 for long enough to trigger > the alert (for: 2m). You will get an alert with annotation: > > description: Current value = 90 > > Maybe then it goes up to 95 for a while. You don't get a new notification > except in certain circumances (group_interval etc). > > When the value of foo drops below the threshold, say to 70, then the alert > ceases to exist. Alertmanager sends out a "resolved" message with all the > labels and annotations of the alert as it was *when it last existed*, i.e. > > description: Current value = 95 > > There's nothing else it can do. The "expr" in the alerting rule returns > no timeseries, which means no values and no labels. You can't create an > annotation for an alert that doesn't exist. > > It's for this reason that I removed all my alert annotations which had > $value in them, since the Resolved messages are confusing. However you > could instead change them to something more verbose, e.g. > > description: Most recent triggering value = 95 > > The second issue is, is it possible the value dipped below the threshold > for one rule evaluation interval? > > Prometheus does debouncing in one direction (the alert must be constantly > active "for: 2m" before it goes from Pending into Firing), but not in the > other direction. A single dip below the threshold and it will resolve > immediately, and then it could go into Pending then Firing again. You > would see that as a resolved followed by a new alert. > > There is a closed issue for alertmanager debouncing / flap detection here: > https://github.com/prometheus/alertmanager/issues/204 > > Personally I think prometheus itself should have a "Resolving" state > analogous to "Pending", so a brief trip below the threshold doesn't > instantly resolve - but like I say, that issue is closed. > > HTH, > > Brian. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f5b9954e-10c0-4021-a95b-a819f1beb200%40googlegroups.com.

