Try putting the whole alerting "expr" into the PromQL query browser, and 
switching to graph view.

This will show you the alert vector graphically, with a separate line for 
each alert instance.  If this isn't showing multiple lines, then you won't 
receive multiple alerts.  Then you can break down your query into parts, 
try them individually, to try to understand why it's not working as you 
expect.

Looking at just part of your expression:

*sum(error_counter{service="myservice",other="labels"} unless 
error_counter{service="myservice",other="labels"} offset 1m) > 0*

And taking just the part inside sum():

*error_counter{service="myservice",other="labels"} unless 
error_counter{service="myservice",other="labels"} offset 1m*

This expression is weird. It will only generate a value when the error 
counter first springs into existence.  As soon as it has existed for more 
than 1 minute - even with value zero - then the "unless" cause will 
suppress the expression completely, i.e. it will be an empty instance 
vector.

I think this is probably not what you want.  In any case it's not a good 
idea to have timeseries which come and go; it's very awkward to alert on a 
timeseries appearing or disappearing, and you may have problems with 
staleness, i.e. the timeseries may continue to exist for 5 minutes after 
you've stopped generating points in it.

It's much better to have a timeseries which continues to exist.  That is, 
"error_counter" should spring into existence with value 0, and increment 
when errors occur, and stop incrementing when errors don't occur - but 
continue to keep the value it had before.

If your error_counter timeseries *does* exist continuously, then this 
'unless' clause is probably not what you want.

On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] wrote:

> Hello,
>
> I'm trying to set up some alerts that fire on critical errors, so I'm 
> aiming for immediate & consistent reporting for as much as possible.
>
> So for that matter, I defined the alert rule without a *for* clause:
>
>
>
>
>
>
> *groups:- name: Test alerts  rules:  - alert: MyService Test Alert    
> expr: 'sum(error_counter{service="myservice",other="labels"} unless 
> error_counter{service="myservice",other="labels"} offset 1m) > 0     or 
> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>
> Prometheus is configured to scrape & evaluate at 10 s:
>
>
>
>
> *global:  scrape_interval: 10s  scrape_timeout: 10s  evaluation_interval: 
> 10s*
>
> And the alert manager (docker image *quay.io/prometheus/alertmanager:v0.23.0 
> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with 
> these parameters:
>
>
>
>
>
> *route:  group_by: ['alertname', 'node_name']  group_wait: 30s  
> group_interval: 1m # used to be 5m  repeat_interval: 2m # used to be 3h*
>
> Now what happens when testing is this:
> - on the very first metric generated, the alert fires as expected;
> - on subsequent tests it stops firing;
> - *I kept on running a new test each minute for 20 minutes, but no alert 
> fired again*;
> - I can see the alert state going into *FIRING* in the alerts view in the 
> Prometheus UI;
> - I can see the metric values getting generated when executing the 
> expression query in the Prometheus UI;
>
> Redid the same test suite after a 2 hour break & exactly the same thing 
> happened, including the fact that* the alert fired on the first test!*
>
> What am I missing here? How can I make the alert manager fire that alert 
> on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but 
> let's consider that for testing's sake.
>
> Pretty please, any advice is much appreciated!
>
> Kind regards,
> Ionel
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ec6999f0-a2cd-4c5e-8569-e6204586d030n%40googlegroups.com.

Reply via email to