Hi Brian, Indeed, that was the issue this time, we didn't have HA properly configured. All seems to work fine after adjusting accordingly. Thank you very much!
On Tuesday, 28 June 2022 at 19:02:26 UTC+1 Brian Candler wrote: > It's correct for prometheus to send alerts to both alertmanagers, but I > suspect you haven't got the alertmanagers clustered together correctly. > See: > https://prometheus.io/docs/alerting/latest/alertmanager/#high-availability > > Make sure you've configured the cluster flags, and check your alertmanager > container logs for messages relating to clustering or "gossip". > > On Tuesday, 28 June 2022 at 16:12:53 UTC+1 [email protected] wrote: > >> Hi Brian, >> >> So my previous assumption proved to be correct - it was in fact the >> alertmanager settings that weren't getting properly applied on the fly. >> Today I ensured they were applied in a guaranteed way & I can see the >> alerts firing every 6 minutes now, for these settings: >> * group_wait: 30s* >> * group_interval: 2m* >> * repeat_interval: 5m* >> >> Now I'm trying to sort out the fact that the alerts fire twice each time. >> We have some form of HA in place, where we spawn 2 pods for the >> alertmanager & looking at their logs, I can see that each container fires >> the alert, which explains why I see 2 of them: >> >> >> *prometheus-alertmanager-0 level=debug ts=2022-06-28T14:27:40.121Z >> caller=notify.go:735 component=dispatcher receiver=pager >> integration=slack[0] msg="Notify success" >> attempts=1prometheus-alertmanager-1 level=debug ts=2022-06-28T14:27:40.418Z >> caller=notify.go:735 component=dispatcher receiver=pager >> integration=slack[0] msg="Notify success" attempts=1* >> >> Any idea why that is? >> >> Thank you! >> On Monday, 27 June 2022 at 17:20:29 UTC+1 Brian Candler wrote: >> >>> Look at container logs then. >>> >>> Metrics include things like the number of notifications attempted, >>> succeeded and failed. Those would be the obvious first place to look. >>> (For example: is it actually trying to send a mail? if so, is it succeeding >>> or failing?) >>> >>> Aside: vector(0) and vector(1) are the same for generating alerts. It's >>> only the presence of a value that triggers an alert, the actual value >>> itself can be anything. >>> >>> On Monday, 27 June 2022 at 16:28:46 UTC+1 [email protected] wrote: >>> >>>> Ok, added a rule with an expression of *vector(1)*, went live at >>>> 12:31, when it fired 2 alerts (?!), but then went completely silent until >>>> 15:36, when it fired again 2x (so more than 3 h in). The alert has been >>>> stuck in the *FIRING* state the whole time, as expected. >>>> Unfortunately the logs don't shed any light - there's nothing logged >>>> aside from the bootstrap logs. It isn't a systemd process - it's run in a >>>> container & there seems to be just a big executable in there. >>>> The meta-metrics contain quite a lot of data in there - any particulars >>>> I should be looking for? >>>> >>>> Either way, I'm now inclined to believe that this is definitely an >>>> *alertmanager* setting matter. As I was mentioning in my initial >>>> email, I've already tweaked *group_wait,* *group_interval & * >>>> *repeat_interval*, but they probably didn't take effect, as I thought >>>> they would. So maybe that's something I need to sort out. And better >>>> logging should help understand all of that, which I still need to figure >>>> out how to do. >>>> >>>> Thank you very much for your help Brian! >>>> >>>> On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote: >>>> >>>>> I suspect the easiest way to debug this is to focus on "*repeat_interval: >>>>> 2m*". Even if a single alert is statically firing, you should get >>>>> the same notification resent every 2 minutes. So don't worry about >>>>> catching second instances of the same expr; just set a simple alerting >>>>> expression which fires continuously, say just "expr: vector(0)", to find >>>>> out why it's not resending. >>>>> >>>>> You can then look at logs from alertmanager (e.g. "journalctl -eu >>>>> alertmanager" if running under systemd). You can also look at the metrics >>>>> alertmanager itself generates: >>>>> >>>>> curl localhost:9093/metrics | grep alertmanager >>>>> >>>>> Hopefully, one of these may give you a clue as to what's happening >>>>> (e.g. maybe your mail system or other notification endpoint has some sort >>>>> of rate limiting??). >>>>> >>>>> However, if the vector(0) expression *does* send repeated alerts >>>>> successfully, then your problem is most likely something to do with your >>>>> actual alerting expr, and you'll need to break it down into simpler >>>>> pieces >>>>> to debug it. >>>>> >>>>> Apart from that, all I can say is "it works for me™": if an alerting >>>>> expression subsequently generates a second alert in its result vector, >>>>> then >>>>> I get another alert after group_interval. >>>>> >>>>> On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote: >>>>> >>>>>> Hi Brian, >>>>>> >>>>>> Thanks for your reply! To be honest, you can pretty much ignore that >>>>>> first part of the expression, that doesn't change anything in the >>>>>> "repeat" >>>>>> behaviour. In fact, we don't even have that bit at the moment, that's >>>>>> just >>>>>> something I've been playing with in order to capture that very first >>>>>> springing into existence of the metric, which isn't covered by the >>>>>> current >>>>>> expression, >>>>>> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) >>>>>> > 0'*. >>>>>> Also, I've already done the PromQL graphing that you suggested, I >>>>>> could see those multiple lines that you were talking about, but then >>>>>> there >>>>>> was no alert firing... 🤷♂️ >>>>>> >>>>>> Any other pointers? >>>>>> >>>>>> Thanks, >>>>>> Ionel >>>>>> >>>>>> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote: >>>>>> >>>>>>> Try putting the whole alerting "expr" into the PromQL query browser, >>>>>>> and switching to graph view. >>>>>>> >>>>>>> This will show you the alert vector graphically, with a separate >>>>>>> line for each alert instance. If this isn't showing multiple lines, >>>>>>> then >>>>>>> you won't receive multiple alerts. Then you can break down your query >>>>>>> into >>>>>>> parts, try them individually, to try to understand why it's not working >>>>>>> as >>>>>>> you expect. >>>>>>> >>>>>>> Looking at just part of your expression: >>>>>>> >>>>>>> *sum(error_counter{service="myservice",other="labels"} unless >>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0* >>>>>>> >>>>>>> And taking just the part inside sum(): >>>>>>> >>>>>>> *error_counter{service="myservice",other="labels"} unless >>>>>>> error_counter{service="myservice",other="labels"} offset 1m* >>>>>>> >>>>>>> This expression is weird. It will only generate a value when the >>>>>>> error counter first springs into existence. As soon as it has existed >>>>>>> for >>>>>>> more than 1 minute - even with value zero - then the "unless" cause >>>>>>> will >>>>>>> suppress the expression completely, i.e. it will be an empty instance >>>>>>> vector. >>>>>>> >>>>>>> I think this is probably not what you want. In any case it's not a >>>>>>> good idea to have timeseries which come and go; it's very awkward to >>>>>>> alert >>>>>>> on a timeseries appearing or disappearing, and you may have problems >>>>>>> with >>>>>>> staleness, i.e. the timeseries may continue to exist for 5 minutes >>>>>>> after >>>>>>> you've stopped generating points in it. >>>>>>> >>>>>>> It's much better to have a timeseries which continues to exist. >>>>>>> That is, "error_counter" should spring into existence with value 0, and >>>>>>> increment when errors occur, and stop incrementing when errors don't >>>>>>> occur >>>>>>> - but continue to keep the value it had before. >>>>>>> >>>>>>> If your error_counter timeseries *does* exist continuously, then >>>>>>> this 'unless' clause is probably not what you want. >>>>>>> >>>>>>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I'm trying to set up some alerts that fire on critical errors, so >>>>>>>> I'm aiming for immediate & consistent reporting for as much as >>>>>>>> possible. >>>>>>>> >>>>>>>> So for that matter, I defined the alert rule without a *for* >>>>>>>> clause: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *groups:- name: Test alerts rules: - alert: MyService Test Alert >>>>>>>> expr: 'sum(error_counter{service="myservice",other="labels"} unless >>>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0 >>>>>>>> or >>>>>>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'* >>>>>>>> >>>>>>>> Prometheus is configured to scrape & evaluate at 10 s: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *global: scrape_interval: 10s scrape_timeout: 10s >>>>>>>> evaluation_interval: 10s* >>>>>>>> >>>>>>>> And the alert manager (docker image >>>>>>>> *quay.io/prometheus/alertmanager:v0.23.0 >>>>>>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured >>>>>>>> with these parameters: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *route: group_by: ['alertname', 'node_name'] group_wait: 30s >>>>>>>> group_interval: 1m # used to be 5m repeat_interval: 2m # used to be >>>>>>>> 3h* >>>>>>>> >>>>>>>> Now what happens when testing is this: >>>>>>>> - on the very first metric generated, the alert fires as expected; >>>>>>>> - on subsequent tests it stops firing; >>>>>>>> - *I kept on running a new test each minute for 20 minutes, but no >>>>>>>> alert fired again*; >>>>>>>> - I can see the alert state going into *FIRING* in the alerts view >>>>>>>> in the Prometheus UI; >>>>>>>> - I can see the metric values getting generated when executing the >>>>>>>> expression query in the Prometheus UI; >>>>>>>> >>>>>>>> Redid the same test suite after a 2 hour break & exactly the same >>>>>>>> thing happened, including the fact that* the alert fired on the >>>>>>>> first test!* >>>>>>>> >>>>>>>> What am I missing here? How can I make the alert manager fire that >>>>>>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon >>>>>>>> as >>>>>>>> 2m, but let's consider that for testing's sake. >>>>>>>> >>>>>>>> Pretty please, any advice is much appreciated! >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Ionel >>>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d3c2fc39-0fc2-43cb-903d-9b37eb5e80c4n%40googlegroups.com.

