It's correct for prometheus to send alerts to both alertmanagers, but I 
suspect you haven't got the alertmanagers clustered together correctly.
See: 
https://prometheus.io/docs/alerting/latest/alertmanager/#high-availability

Make sure you've configured the cluster flags, and check your alertmanager 
container logs for messages relating to clustering or "gossip".

On Tuesday, 28 June 2022 at 16:12:53 UTC+1 [email protected] wrote:

> Hi Brian,
>
> So my previous assumption proved to be correct - it was in fact the 
> alertmanager settings that weren't getting properly applied on the fly. 
> Today I ensured they were applied in a guaranteed way & I can see the 
> alerts firing every 6 minutes now, for these settings:
> *    group_wait: 30s*
> *    group_interval: 2m*
> *    repeat_interval: 5m*
>
> Now I'm trying to sort out the fact that the alerts fire twice each time. 
> We have some form of HA in place, where we spawn 2 pods for the 
> alertmanager & looking at their logs, I can see that each container fires 
> the alert, which explains why I see 2 of them:
>
>
> *prometheus-alertmanager-0 level=debug ts=2022-06-28T14:27:40.121Z 
> caller=notify.go:735 component=dispatcher receiver=pager 
> integration=slack[0] msg="Notify success" 
> attempts=1prometheus-alertmanager-1 level=debug ts=2022-06-28T14:27:40.418Z 
> caller=notify.go:735 component=dispatcher receiver=pager 
> integration=slack[0] msg="Notify success" attempts=1*
>
> Any idea why that is?
>
> Thank you!
> On Monday, 27 June 2022 at 17:20:29 UTC+1 Brian Candler wrote:
>
>> Look at container logs then.
>>
>> Metrics include things like the number of notifications attempted, 
>> succeeded and failed.  Those would be the obvious first place to look.  
>> (For example: is it actually trying to send a mail? if so, is it succeeding 
>> or failing?)
>>
>> Aside: vector(0) and vector(1) are the same for generating alerts. It's 
>> only the presence of a value that triggers an alert, the actual value 
>> itself can be anything.
>>
>> On Monday, 27 June 2022 at 16:28:46 UTC+1 [email protected] wrote:
>>
>>> Ok, added a rule with an expression of *vector(1)*, went live at 12:31, 
>>> when it fired 2 alerts  (?!), but then went completely silent until 15:36, 
>>> when it fired again 2x (so more than 3 h in). The alert has been stuck in 
>>> the *FIRING* state the whole time, as expected.
>>> Unfortunately the logs don't shed any light - there's nothing logged 
>>> aside from the bootstrap logs. It isn't a systemd process - it's run in a 
>>> container & there seems to be just a big executable in there.
>>> The meta-metrics contain quite a lot of data in there - any particulars 
>>> I should be looking for?
>>>
>>> Either way, I'm now inclined to believe that this is definitely an 
>>> *alertmanager* setting matter. As I was mentioning in my initial email, 
>>> I've already tweaked *group_wait,* *group_interval & **repeat_interval*, 
>>> but they probably didn't take effect, as I thought they would. So maybe 
>>> that's something I need to sort out. And better logging should help 
>>> understand all of that, which I still need to figure out how to do.
>>>
>>> Thank you very much for your help Brian!
>>>
>>> On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote:
>>>
>>>> I suspect the easiest way to debug this is to focus on "*repeat_interval: 
>>>> 2m*".  Even if a single alert is statically firing, you should get the 
>>>> same notification resent every 2 minutes.  So don't worry about catching 
>>>> second instances of the same expr; just set a simple alerting expression 
>>>> which fires continuously, say just "expr: vector(0)", to find out why it's 
>>>> not resending.
>>>>
>>>> You can then look at logs from alertmanager (e.g. "journalctl -eu 
>>>> alertmanager" if running under systemd). You can also look at the metrics 
>>>> alertmanager itself generates:
>>>>
>>>>     curl localhost:9093/metrics | grep alertmanager
>>>>
>>>> Hopefully, one of these may give you a clue as to what's happening 
>>>> (e.g. maybe your mail system or other notification endpoint has some sort 
>>>> of rate limiting??).
>>>>
>>>> However, if the vector(0) expression *does* send repeated alerts 
>>>> successfully, then your problem is most likely something to do with your 
>>>> actual alerting expr, and you'll need to break it down into simpler pieces 
>>>> to debug it.
>>>>
>>>> Apart from that, all I can say is "it works for me™": if an alerting 
>>>> expression subsequently generates a second alert in its result vector, 
>>>> then 
>>>> I get another alert after group_interval.
>>>>
>>>> On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote:
>>>>
>>>>> Hi Brian,
>>>>>
>>>>> Thanks for your reply! To be honest, you can pretty much ignore that 
>>>>> first part of the expression, that doesn't change anything in the 
>>>>> "repeat" 
>>>>> behaviour. In fact, we don't even have that bit at the moment, that's 
>>>>> just 
>>>>> something I've been playing with in order to capture that very first 
>>>>> springing into existence of the metric, which isn't covered by the 
>>>>> current 
>>>>> expression,  
>>>>> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) 
>>>>> > 0'*.
>>>>> Also, I've already done the PromQL graphing that you suggested, I 
>>>>> could see those multiple lines that you were talking about, but then 
>>>>> there 
>>>>> was no alert firing... 🤷‍♂️
>>>>>
>>>>> Any other pointers?
>>>>>
>>>>> Thanks,
>>>>> Ionel
>>>>>
>>>>> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote:
>>>>>
>>>>>> Try putting the whole alerting "expr" into the PromQL query browser, 
>>>>>> and switching to graph view.
>>>>>>
>>>>>> This will show you the alert vector graphically, with a separate line 
>>>>>> for each alert instance.  If this isn't showing multiple lines, then you 
>>>>>> won't receive multiple alerts.  Then you can break down your query into 
>>>>>> parts, try them individually, to try to understand why it's not working 
>>>>>> as 
>>>>>> you expect.
>>>>>>
>>>>>> Looking at just part of your expression:
>>>>>>
>>>>>> *sum(error_counter{service="myservice",other="labels"} unless 
>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0*
>>>>>>
>>>>>> And taking just the part inside sum():
>>>>>>
>>>>>> *error_counter{service="myservice",other="labels"} unless 
>>>>>> error_counter{service="myservice",other="labels"} offset 1m*
>>>>>>
>>>>>> This expression is weird. It will only generate a value when the 
>>>>>> error counter first springs into existence.  As soon as it has existed 
>>>>>> for 
>>>>>> more than 1 minute - even with value zero - then the "unless" cause will 
>>>>>> suppress the expression completely, i.e. it will be an empty instance 
>>>>>> vector.
>>>>>>
>>>>>> I think this is probably not what you want.  In any case it's not a 
>>>>>> good idea to have timeseries which come and go; it's very awkward to 
>>>>>> alert 
>>>>>> on a timeseries appearing or disappearing, and you may have problems 
>>>>>> with 
>>>>>> staleness, i.e. the timeseries may continue to exist for 5 minutes after 
>>>>>> you've stopped generating points in it.
>>>>>>
>>>>>> It's much better to have a timeseries which continues to exist.  That 
>>>>>> is, "error_counter" should spring into existence with value 0, and 
>>>>>> increment when errors occur, and stop incrementing when errors don't 
>>>>>> occur 
>>>>>> - but continue to keep the value it had before.
>>>>>>
>>>>>> If your error_counter timeseries *does* exist continuously, then this 
>>>>>> 'unless' clause is probably not what you want.
>>>>>>
>>>>>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm trying to set up some alerts that fire on critical errors, so 
>>>>>>> I'm aiming for immediate & consistent reporting for as much as possible.
>>>>>>>
>>>>>>> So for that matter, I defined the alert rule without a *for* clause:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *groups:- name: Test alerts  rules:  - alert: MyService Test Alert  
>>>>>>>   expr: 'sum(error_counter{service="myservice",other="labels"} unless 
>>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0     or 
>>>>>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>>>>>>>
>>>>>>> Prometheus is configured to scrape & evaluate at 10 s:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *global:  scrape_interval: 10s  scrape_timeout: 10s  
>>>>>>> evaluation_interval: 10s*
>>>>>>>
>>>>>>> And the alert manager (docker image 
>>>>>>> *quay.io/prometheus/alertmanager:v0.23.0 
>>>>>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured 
>>>>>>> with these parameters:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *route:  group_by: ['alertname', 'node_name']  group_wait: 30s  
>>>>>>> group_interval: 1m # used to be 5m  repeat_interval: 2m # used to be 3h*
>>>>>>>
>>>>>>> Now what happens when testing is this:
>>>>>>> - on the very first metric generated, the alert fires as expected;
>>>>>>> - on subsequent tests it stops firing;
>>>>>>> - *I kept on running a new test each minute for 20 minutes, but no 
>>>>>>> alert fired again*;
>>>>>>> - I can see the alert state going into *FIRING* in the alerts view 
>>>>>>> in the Prometheus UI;
>>>>>>> - I can see the metric values getting generated when executing the 
>>>>>>> expression query in the Prometheus UI;
>>>>>>>
>>>>>>> Redid the same test suite after a 2 hour break & exactly the same 
>>>>>>> thing happened, including the fact that* the alert fired on the 
>>>>>>> first test!*
>>>>>>>
>>>>>>> What am I missing here? How can I make the alert manager fire that 
>>>>>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon 
>>>>>>> as 
>>>>>>> 2m, but let's consider that for testing's sake.
>>>>>>>
>>>>>>> Pretty please, any advice is much appreciated!
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Ionel
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dabd9915-05dc-475a-b143-a6154095f1c9n%40googlegroups.com.

Reply via email to