Re: [prometheus-users] Discrepancy in Resolved Alerts.

Brian Candler Sat, 18 Apr 2020 02:47:21 -0700

I can see two possible issues here.

Firstly, the value of the annotation you see in the resolved messsage is 
always the value at the time *before* the alert resolved, not the value 
which is now below the threshold.

Let me simplify your expression to:

foo > 85

This is a PromQL filter. In general there could be many timeseries for
metric "foo". If you have ten timeseries, and two of them have values over
85, then the result of this expression is those two timeseries, with their
labels and those two values above 85. But if all the timeseries are below
85, then this expression returns no timeseries, and therefore it has no
values.

So: suppose one "foo" timeseries goes up to 90 for long enough to trigger
the alert (for: 2m). You will get an alert with annotation:

description: Current value = 90

Maybe then it goes up to 95 for a while. You don't get a new notification
except in certain circumances (group_interval etc).

When the value of foo drops below the threshold, say to 70, then the alert
ceases to exist. Alertmanager sends out a "resolved" message with all the
labels and annotations of the alert as it was *when it last existed*, i.e.

description: Current value = 95

There's nothing else it can do. The "expr" in the alerting rule returns no
timeseries, which means no values and no labels. You can't create an
annotation for an alert that doesn't exist.

It's for this reason that I removed all my alert annotations which had
$value in them, since the Resolved messages are confusing. However you
could instead change them to something more verbose, e.g.

description: Most recent triggering value = 95

The second issue is, is it possible the value dipped below the threshold
for one rule evaluation interval?

Prometheus does debouncing in one direction (the alert must be constantly
active "for: 2m" before it goes from Pending into Firing), but not in the
other direction. A single dip below the threshold and it will resolve
immediately, and then it could go into Pending then Firing again. You
would see that as a resolved followed by a new alert.

There is a closed issue for alertmanager debouncing / flap detection here:
https://github.com/prometheus/alertmanager/issues/204

Personally I think prometheus itself should have a "Resolving" state
analogous to "Pending", so a brief trip below the threshold doesn't
instantly resolve - but like I say, that issue is closed.

HTH,

Brian.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6112779d-7e79-45f9-8fd7-6e73236651fa%40googlegroups.com.

Re: [prometheus-users] Discrepancy in Resolved Alerts.

Reply via email to