[prometheus-users] Re: Failed ping in HA alertmanagers

'Ionel Sirbu' via Prometheus Users Tue, 23 Aug 2022 01:46:53 -0700

Any thoughts on this, anyone?

On Friday, 5 August 2022 at 11:38:06 UTC+1 Ionel Sirbu wrote:


> Hello all,
>
> We've recently configured our alertmanagers to be HA as per the specs:
> - 3 instances, using a kubernetes statefulset;
> - both TCP & UDP opened for the HA cluster port:
>
>
>
>
>
>
>
>
>
>
> *    ports:    - containerPort: 8001      name: service      protocol: 
> TCP    - containerPort: 8002      name: ha-tcp      protocol: TCP    - 
> containerPort: 8002      name: ha-udp      protocol: UDP*
>
> - all 3 instances point to instance 0 for clustering (I assumed there 
> wouldn't be a problem with instance 0 pointing to itself):
>
>
>
>
>
>
> *spec:  containers:  - args:    // ...    - 
> --cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002*
> *    image: quay.io/prometheus/alertmanager:v0.23.0 
> <http://quay.io/prometheus/alertmanager:v0.23.0>*
>
> - prometheus points to the 3 alertmanager instances:
>
>   
>
>
>
>
> *alertmanagers:    - static_configs:      - targets:        - 
> testprom-am-0.testprom-am.default.svc.cluster.local:8001        - 
> testprom-am-1.testprom-am.default.svc.cluster.local:8001        - 
> testprom-am-2.testprom-am.default.svc.cluster.local:8001*
>
> However, against all that, we keep getting errors like this rather often 
> (e.g. 124 within 30 minutes):
>
> *level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 
> component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: 
> Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"*
>
> Is that something to worry about? Is there anything more that needs to be 
> configured with regards to HA?
> With the exception of a particular case, alerts seem to work just fine. 
> It's when we do a rolling upgrade to the kubernetes cluster that previous 
> alerts fire again all of a sudden. Any idea what could be causing that?
>
> Many thanks,
> Ionel
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/cd51ce55-20f9-4e1c-8045-23a59584c611n%40googlegroups.com.

[prometheus-users] Re: Failed ping in HA alertmanagers

Reply via email to