Hello all, We've recently configured our alertmanagers to be HA as per the specs: - 3 instances, using a kubernetes statefulset; - both TCP & UDP opened for the HA cluster port:
* ports: - containerPort: 8001 name: service protocol: TCP - containerPort: 8002 name: ha-tcp protocol: TCP - containerPort: 8002 name: ha-udp protocol: UDP* - all 3 instances point to instance 0 for clustering (I assumed there wouldn't be a problem with instance 0 pointing to itself): *spec: containers: - args: // ... - --cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002* * image: quay.io/prometheus/alertmanager:v0.23.0* - prometheus points to the 3 alertmanager instances: *alertmanagers: - static_configs: - targets: - testprom-am-0.testprom-am.default.svc.cluster.local:8001 - testprom-am-1.testprom-am.default.svc.cluster.local:8001 - testprom-am-2.testprom-am.default.svc.cluster.local:8001* However, against all that, we keep getting errors like this rather often (e.g. 124 within 30 minutes): *level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"* Is that something to worry about? Is there anything more that needs to be configured with regards to HA? With the exception of a particular case, alerts seem to work just fine. It's when we do a rolling upgrade to the kubernetes cluster that previous alerts fire again all of a sudden. Any idea what could be causing that? Many thanks, Ionel -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5fe62f9f-a9ef-42af-bc4a-7a603a602c5cn%40googlegroups.com.

