It's correct for prometheus to send alerts to both alertmanagers, but I suspect you haven't got the alertmanagers clustered together correctly. See: https://prometheus.io/docs/alerting/latest/alertmanager/#high-availability
Make sure you've configured the cluster flags, and check your alertmanager container logs for messages relating to clustering or "gossip". On Tuesday, 28 June 2022 at 16:12:53 UTC+1 [email protected] wrote: > Hi Brian, > > So my previous assumption proved to be correct - it was in fact the > alertmanager settings that weren't getting properly applied on the fly. > Today I ensured they were applied in a guaranteed way & I can see the > alerts firing every 6 minutes now, for these settings: > * group_wait: 30s* > * group_interval: 2m* > * repeat_interval: 5m* > > Now I'm trying to sort out the fact that the alerts fire twice each time. > We have some form of HA in place, where we spawn 2 pods for the > alertmanager & looking at their logs, I can see that each container fires > the alert, which explains why I see 2 of them: > > > *prometheus-alertmanager-0 level=debug ts=2022-06-28T14:27:40.121Z > caller=notify.go:735 component=dispatcher receiver=pager > integration=slack[0] msg="Notify success" > attempts=1prometheus-alertmanager-1 level=debug ts=2022-06-28T14:27:40.418Z > caller=notify.go:735 component=dispatcher receiver=pager > integration=slack[0] msg="Notify success" attempts=1* > > Any idea why that is? > > Thank you! > On Monday, 27 June 2022 at 17:20:29 UTC+1 Brian Candler wrote: > >> Look at container logs then. >> >> Metrics include things like the number of notifications attempted, >> succeeded and failed. Those would be the obvious first place to look. >> (For example: is it actually trying to send a mail? if so, is it succeeding >> or failing?) >> >> Aside: vector(0) and vector(1) are the same for generating alerts. It's >> only the presence of a value that triggers an alert, the actual value >> itself can be anything. >> >> On Monday, 27 June 2022 at 16:28:46 UTC+1 [email protected] wrote: >> >>> Ok, added a rule with an expression of *vector(1)*, went live at 12:31, >>> when it fired 2 alerts (?!), but then went completely silent until 15:36, >>> when it fired again 2x (so more than 3 h in). The alert has been stuck in >>> the *FIRING* state the whole time, as expected. >>> Unfortunately the logs don't shed any light - there's nothing logged >>> aside from the bootstrap logs. It isn't a systemd process - it's run in a >>> container & there seems to be just a big executable in there. >>> The meta-metrics contain quite a lot of data in there - any particulars >>> I should be looking for? >>> >>> Either way, I'm now inclined to believe that this is definitely an >>> *alertmanager* setting matter. As I was mentioning in my initial email, >>> I've already tweaked *group_wait,* *group_interval & **repeat_interval*, >>> but they probably didn't take effect, as I thought they would. So maybe >>> that's something I need to sort out. And better logging should help >>> understand all of that, which I still need to figure out how to do. >>> >>> Thank you very much for your help Brian! >>> >>> On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote: >>> >>>> I suspect the easiest way to debug this is to focus on "*repeat_interval: >>>> 2m*". Even if a single alert is statically firing, you should get the >>>> same notification resent every 2 minutes. So don't worry about catching >>>> second instances of the same expr; just set a simple alerting expression >>>> which fires continuously, say just "expr: vector(0)", to find out why it's >>>> not resending. >>>> >>>> You can then look at logs from alertmanager (e.g. "journalctl -eu >>>> alertmanager" if running under systemd). You can also look at the metrics >>>> alertmanager itself generates: >>>> >>>> curl localhost:9093/metrics | grep alertmanager >>>> >>>> Hopefully, one of these may give you a clue as to what's happening >>>> (e.g. maybe your mail system or other notification endpoint has some sort >>>> of rate limiting??). >>>> >>>> However, if the vector(0) expression *does* send repeated alerts >>>> successfully, then your problem is most likely something to do with your >>>> actual alerting expr, and you'll need to break it down into simpler pieces >>>> to debug it. >>>> >>>> Apart from that, all I can say is "it works for me™": if an alerting >>>> expression subsequently generates a second alert in its result vector, >>>> then >>>> I get another alert after group_interval. >>>> >>>> On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote: >>>> >>>>> Hi Brian, >>>>> >>>>> Thanks for your reply! To be honest, you can pretty much ignore that >>>>> first part of the expression, that doesn't change anything in the >>>>> "repeat" >>>>> behaviour. In fact, we don't even have that bit at the moment, that's >>>>> just >>>>> something I've been playing with in order to capture that very first >>>>> springing into existence of the metric, which isn't covered by the >>>>> current >>>>> expression, >>>>> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) >>>>> > 0'*. >>>>> Also, I've already done the PromQL graphing that you suggested, I >>>>> could see those multiple lines that you were talking about, but then >>>>> there >>>>> was no alert firing... 🤷♂️ >>>>> >>>>> Any other pointers? >>>>> >>>>> Thanks, >>>>> Ionel >>>>> >>>>> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote: >>>>> >>>>>> Try putting the whole alerting "expr" into the PromQL query browser, >>>>>> and switching to graph view. >>>>>> >>>>>> This will show you the alert vector graphically, with a separate line >>>>>> for each alert instance. If this isn't showing multiple lines, then you >>>>>> won't receive multiple alerts. Then you can break down your query into >>>>>> parts, try them individually, to try to understand why it's not working >>>>>> as >>>>>> you expect. >>>>>> >>>>>> Looking at just part of your expression: >>>>>> >>>>>> *sum(error_counter{service="myservice",other="labels"} unless >>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0* >>>>>> >>>>>> And taking just the part inside sum(): >>>>>> >>>>>> *error_counter{service="myservice",other="labels"} unless >>>>>> error_counter{service="myservice",other="labels"} offset 1m* >>>>>> >>>>>> This expression is weird. It will only generate a value when the >>>>>> error counter first springs into existence. As soon as it has existed >>>>>> for >>>>>> more than 1 minute - even with value zero - then the "unless" cause will >>>>>> suppress the expression completely, i.e. it will be an empty instance >>>>>> vector. >>>>>> >>>>>> I think this is probably not what you want. In any case it's not a >>>>>> good idea to have timeseries which come and go; it's very awkward to >>>>>> alert >>>>>> on a timeseries appearing or disappearing, and you may have problems >>>>>> with >>>>>> staleness, i.e. the timeseries may continue to exist for 5 minutes after >>>>>> you've stopped generating points in it. >>>>>> >>>>>> It's much better to have a timeseries which continues to exist. That >>>>>> is, "error_counter" should spring into existence with value 0, and >>>>>> increment when errors occur, and stop incrementing when errors don't >>>>>> occur >>>>>> - but continue to keep the value it had before. >>>>>> >>>>>> If your error_counter timeseries *does* exist continuously, then this >>>>>> 'unless' clause is probably not what you want. >>>>>> >>>>>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I'm trying to set up some alerts that fire on critical errors, so >>>>>>> I'm aiming for immediate & consistent reporting for as much as possible. >>>>>>> >>>>>>> So for that matter, I defined the alert rule without a *for* clause: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *groups:- name: Test alerts rules: - alert: MyService Test Alert >>>>>>> expr: 'sum(error_counter{service="myservice",other="labels"} unless >>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0 or >>>>>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'* >>>>>>> >>>>>>> Prometheus is configured to scrape & evaluate at 10 s: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *global: scrape_interval: 10s scrape_timeout: 10s >>>>>>> evaluation_interval: 10s* >>>>>>> >>>>>>> And the alert manager (docker image >>>>>>> *quay.io/prometheus/alertmanager:v0.23.0 >>>>>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured >>>>>>> with these parameters: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *route: group_by: ['alertname', 'node_name'] group_wait: 30s >>>>>>> group_interval: 1m # used to be 5m repeat_interval: 2m # used to be 3h* >>>>>>> >>>>>>> Now what happens when testing is this: >>>>>>> - on the very first metric generated, the alert fires as expected; >>>>>>> - on subsequent tests it stops firing; >>>>>>> - *I kept on running a new test each minute for 20 minutes, but no >>>>>>> alert fired again*; >>>>>>> - I can see the alert state going into *FIRING* in the alerts view >>>>>>> in the Prometheus UI; >>>>>>> - I can see the metric values getting generated when executing the >>>>>>> expression query in the Prometheus UI; >>>>>>> >>>>>>> Redid the same test suite after a 2 hour break & exactly the same >>>>>>> thing happened, including the fact that* the alert fired on the >>>>>>> first test!* >>>>>>> >>>>>>> What am I missing here? How can I make the alert manager fire that >>>>>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon >>>>>>> as >>>>>>> 2m, but let's consider that for testing's sake. >>>>>>> >>>>>>> Pretty please, any advice is much appreciated! >>>>>>> >>>>>>> Kind regards, >>>>>>> Ionel >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/dabd9915-05dc-475a-b143-a6154095f1c9n%40googlegroups.com.

