Try putting the whole alerting "expr" into the PromQL query browser, and
switching to graph view.
This will show you the alert vector graphically, with a separate line for
each alert instance. If this isn't showing multiple lines, then you won't
receive multiple alerts. Then you can break down your query into parts,
try them individually, to try to understand why it's not working as you
expect.
Looking at just part of your expression:
*sum(error_counter{service="myservice",other="labels"} unless
error_counter{service="myservice",other="labels"} offset 1m) > 0*
And taking just the part inside sum():
*error_counter{service="myservice",other="labels"} unless
error_counter{service="myservice",other="labels"} offset 1m*
This expression is weird. It will only generate a value when the error
counter first springs into existence. As soon as it has existed for more
than 1 minute - even with value zero - then the "unless" cause will
suppress the expression completely, i.e. it will be an empty instance
vector.
I think this is probably not what you want. In any case it's not a good
idea to have timeseries which come and go; it's very awkward to alert on a
timeseries appearing or disappearing, and you may have problems with
staleness, i.e. the timeseries may continue to exist for 5 minutes after
you've stopped generating points in it.
It's much better to have a timeseries which continues to exist. That is,
"error_counter" should spring into existence with value 0, and increment
when errors occur, and stop incrementing when errors don't occur - but
continue to keep the value it had before.
If your error_counter timeseries *does* exist continuously, then this
'unless' clause is probably not what you want.
On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] wrote:
> Hello,
>
> I'm trying to set up some alerts that fire on critical errors, so I'm
> aiming for immediate & consistent reporting for as much as possible.
>
> So for that matter, I defined the alert rule without a *for* clause:
>
>
>
>
>
>
> *groups:- name: Test alerts rules: - alert: MyService Test Alert
> expr: 'sum(error_counter{service="myservice",other="labels"} unless
> error_counter{service="myservice",other="labels"} offset 1m) > 0 or
> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>
> Prometheus is configured to scrape & evaluate at 10 s:
>
>
>
>
> *global: scrape_interval: 10s scrape_timeout: 10s evaluation_interval:
> 10s*
>
> And the alert manager (docker image *quay.io/prometheus/alertmanager:v0.23.0
> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with
> these parameters:
>
>
>
>
>
> *route: group_by: ['alertname', 'node_name'] group_wait: 30s
> group_interval: 1m # used to be 5m repeat_interval: 2m # used to be 3h*
>
> Now what happens when testing is this:
> - on the very first metric generated, the alert fires as expected;
> - on subsequent tests it stops firing;
> - *I kept on running a new test each minute for 20 minutes, but no alert
> fired again*;
> - I can see the alert state going into *FIRING* in the alerts view in the
> Prometheus UI;
> - I can see the metric values getting generated when executing the
> expression query in the Prometheus UI;
>
> Redid the same test suite after a 2 hour break & exactly the same thing
> happened, including the fact that* the alert fired on the first test!*
>
> What am I missing here? How can I make the alert manager fire that alert
> on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but
> let's consider that for testing's sake.
>
> Pretty please, any advice is much appreciated!
>
> Kind regards,
> Ionel
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/ec6999f0-a2cd-4c5e-8569-e6204586d030n%40googlegroups.com.