Hello,
I'm trying to set up some alerts that fire on critical errors, so I'm
aiming for immediate & consistent reporting for as much as possible.
So for that matter, I defined the alert rule without a *for* clause:
*groups:- name: Test alerts rules: - alert: MyService Test Alert expr:
'sum(error_counter{service="myservice",other="labels"} unless
error_counter{service="myservice",other="labels"} offset 1m) > 0 or
sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
Prometheus is configured to scrape & evaluate at 10 s:
*global: scrape_interval: 10s scrape_timeout: 10s evaluation_interval:
10s*
And the alert manager (docker image
*quay.io/prometheus/alertmanager:v0.23.0*) is configured with these
parameters:
*route: group_by: ['alertname', 'node_name'] group_wait: 30s
group_interval: 1m # used to be 5m repeat_interval: 2m # used to be 3h*
Now what happens when testing is this:
- on the very first metric generated, the alert fires as expected;
- on subsequent tests it stops firing;
- *I kept on running a new test each minute for 20 minutes, but no alert
fired again*;
- I can see the alert state going into *FIRING* in the alerts view in the
Prometheus UI;
- I can see the metric values getting generated when executing the
expression query in the Prometheus UI;
Redid the same test suite after a 2 hour break & exactly the same thing
happened, including the fact that* the alert fired on the first test!*
What am I missing here? How can I make the alert manager fire that alert on
repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but
let's consider that for testing's sake.
Pretty please, any advice is much appreciated!
Kind regards,
Ionel
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/d3a7b3f2-4fc3-42a9-9c53-f9aa28251a5cn%40googlegroups.com.