Hello all,

We've recently configured our alertmanagers to be HA as per the specs:
- 3 instances, using a kubernetes statefulset;
- both TCP & UDP opened for the HA cluster port:










*    ports:    - containerPort: 8001      name: service      protocol: TCP  
  - containerPort: 8002      name: ha-tcp      protocol: TCP    - 
containerPort: 8002      name: ha-udp      protocol: UDP*

- all 3 instances point to instance 0 for clustering (I assumed there 
wouldn't be a problem with instance 0 pointing to itself):






*spec:  containers:  - args:    // ...    - 
--cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002*
*    image: quay.io/prometheus/alertmanager:v0.23.0*

- prometheus points to the 3 alertmanager instances:

  




*alertmanagers:    - static_configs:      - targets:        - 
testprom-am-0.testprom-am.default.svc.cluster.local:8001        - 
testprom-am-1.testprom-am.default.svc.cluster.local:8001        - 
testprom-am-2.testprom-am.default.svc.cluster.local:8001*

However, against all that, we keep getting errors like this rather often 
(e.g. 124 within 30 minutes):

*level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 
component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: 
Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"*

Is that something to worry about? Is there anything more that needs to be 
configured with regards to HA?
With the exception of a particular case, alerts seem to work just fine. 
It's when we do a rolling upgrade to the kubernetes cluster that previous 
alerts fire again all of a sudden. Any idea what could be causing that?

Many thanks,
Ionel

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5fe62f9f-a9ef-42af-bc4a-7a603a602c5cn%40googlegroups.com.

Reply via email to