@Brian Do you confirm that one of the main reason that Alertmanager cluster needs to handle the same set of alerts as the others (and hence be plugged to all same prometheuses) is because of the way deduplication works? It's because of the 5s delay times the position of the alertmanager in the gossip cluster, right? - It would be cool to be able to tell each alertmanager: this is my "alert-family". The deduplication mechanism would then only occur among the members sharing the same "alert-family" value. That way, we are not forced to connect 20 Proms to every alertmanager anymore, and all of them can still gossip the valuable "silences". - And/or being able to give to amtool a list of clusters so that it can also handle several clusters.
@Matthias Yes, the main benefits I see with your proposal is: - *Silencings* are automatically propagated to all nodes - And the *group_by* will become a "real" global group_by (instead of having 10 sub-optimal group_by that are only capable of grouping by region) So this is very appealing. However, I'm just worried about exposing myself more to network partitions with this solution? With that solution of far away alertmanagers, there is more probability that one node will become totally isolated, and will hence stop to deduplicate and send more alerts. At least with 2 alertmanagers in the same region, those partitions are still a possibility (missconfiguration, firewall, local issue), but much less likely. As I understand it, memberlist library handles network partitions very well? But I guess only if one node is partially isolated. Does someone have more information to share about the behaviour of that Gossip Protocol during network partitions? There is low level documentation in the source code of the lib, but I'm struggling to find a more high-level documentation for it. For now, I know that there are awareness/suspicion mechanism, but I'm not sure exactly how it works. If Alertmanager has (through the memberlist lib) a high awareness of being not properly working, will it decide to stop sending alerts? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/58981b30-96f8-4dcf-a9b7-8e9823498197n%40googlegroups.com.

