Thanks you for the further clarification. I think the crux of my issue was 
(wrongfully) assuming that the documentation was instructing me to not use 
a load balancer for HA/network partitioning concerns only, and not that 
full Alertmanager cluster state isn't being gossiped. I may try to put a PR 
up on Monday for the docs to clarify this for what would have saved us a 
bit of time debugging.

On Saturday, December 4, 2021 at 7:52:21 PM UTC-5 [email protected] 
wrote:

> The technical reason for this admonition is in how the 
> Prometheus-Alertmanager complex implements high availability notifications.
>
> The design goal is to send a notification in all possible circumstances, 
> and *if possible* only send one.
>
> By spraying alerts to the list of all Alertmanager instances, each of 
> these *can* send the notification even if Alertmanager clustering is 
> completely broken, for example due to network partitions, misconfiguration, 
> or some Alertmanager instances being unable to send out the notification.
>
>  Worst case, you get multiple notifications, one from each Alertmanager. 
> Some downstream services, like PagerDuty, will do their own deduplication, 
> so you may not even notice. In other cases, like Slack or email, you get 
> multiple but that's much better than none!
>
> Every time Prometheus evaluates an alert rule, and finds it to be firing, 
> it will send an event to every Alertmanager it knows about, with an endsAt 
> time a few minutes into the future. As this goes on, the updated endsAt 
> keeps being a few minutes away.
>
> Each Alertmanager individually will determine what
> notifications (firing or resolved) should be sent. When clustering works, 
> Alertmanagers will communicate which notifications have already been sent, 
> so you only get one of each in the happy case.
>
> If you add a load balancer, only one Alertmanager will know that this 
> alert even happened, and if for some reason it can't reach you, you may 
> never know there was a problem.
>
> This is somewhat mitigated in your case because Prometheus sends a new 
> event on every rule evaluation cycle. Eventually, this will randomly reach 
> every Alertmanager instance, but not necessarily in time to prevent the 
> last event from timing out. These different timeouts is what you have 
> observed as different endsAt times.
>
> So the underlying reason is as you say – high availability and network 
> partitioning. The architecture to achieve that, with Prometheus repeatedly 
> sending short-term events, means that randomly load balancing these to only 
> one of the Alertmanager instances will lead to weird effects including 
> spurious "resolved" notifications.
>
> /MR
>
>
> On Sat, Dec 4, 2021, 19:17 Brian Candler <[email protected]> wrote:
>
>> Just to note what it says here 
>> <https://prometheus.io/docs/alerting/latest/alertmanager/#high-availa>:
>>
>> *It's important not to load balance traffic between Prometheus and its 
>> Alertmanagers, but instead, point Prometheus to a list of all 
>> Alertmanagers.*
>>
>>> -- 
>>
> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/22ae33ed-429c-4783-8aaa-44c749bf26abn%40googlegroups.com.

Reply via email to