Hello.

First of all, english is not my native language so excuse me if I cannot 
explain myself well enough.

I am facing the following situation:

- Several Prometheus servers deployed in clusters dedicated to development 
environments (dev, itg, pre, pro), federated against a central one (utils). 
All of them in HA configuration.
- Each of the Prometheus has external labels configured (dev, itg, pre, 
pro, utils), for example:
externalLabels:
  cluster: dev-gke-cluster
  environment: dev
- Only 1 alertmanager deployed alongside the central Prometheus, in HA 
configuration.
- honor labels is enabled for federated targets.

Prometheus was deployed with helm chart prometheus-operator.

The problem here is, with some of the prometheus-operator default alert 
rules I am not able to tell where the alert comes from, because the 
external labels get overwritten.
For example, with the KubePodNotReady alert:

In Prometheus alerts tab:

Annotations

message
Pod demo-apps-devops-back/fwk-springboot-service-example-969897fd4-6c6gd 
has been in a non-ready state for longer than 15 minutes.
runbook_url
https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready
alertname=KubePodNotReadynamespace=elasticpod=filebeat-filebeat-m8rgl
severity=critical FIRING 2020-06-12T06:03:39.737453697Z 1e+00

In Alertmanager:

06:18:09, 2020-06-12 (UTC)InfoSource 
<http://prometheus.gcp.mercadona.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28max+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22.%2A%22%2Cphase%3D~%22Pending%7CUnknown%22%7D%29+%2A+on%28namespace%2C+pod%29+group_left%28owner_kind%29+max+by%28namespace%2C+pod%2C+owner_kind%29+%28kube_pod_owner%7Bowner_kind%21%3D%22Job%22%7D%29%29+%3E+0&g0.tab=1>
Silence 
<https://alertmanager.gcp.mercadona.com/#/silences/new?filter=%7Balertname%3D%22KubePodNotReady%22%2C%20cloud%3D%22gcp%22%2C%20cluster%3D%22mdona-cloud-utils-gke-cluster%22%2C%20environment%3D%22utils%22%2C%20namespace%3D%22demo-apps-devops-back%22%2C%20pod%3D%22fwk-springboot-service-example-969897fd4-6c6gd%22%2C%20prometheus%3D%22prometheus%2Fprometheus-prometheus-oper-prometheus%22%2C%20region%3D%22europe-west%22%2C%20severity%3D%22critical%22%7D>
cloud="gcp"+
cluster="utils-gke-cluster"+
environment="utils"+
namespace="demo-apps-devops-back"+
pod="fwk-springboot-service-example-969897fd4-6c6gd"+
prometheus="prometheus/prometheus-prometheus-oper-prometheus"+
region="europe-west"+
severity="critical"

This alert refers to a pod and namespace that do not exist in the "utils" 
environment, but in the "dev" one, even though we defined the environment 
external label. All the labels here belong to the "utils" Prometheus, where 
all the metrics are gathered and from where the alerts are generated.

We have found that this happens whenever the alert rule expression has any 
type of aggregation, as the one in the example:

sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job=
"kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(
namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (
kube_pod_owner{owner_kind!="Job"})) > 0 
<https://prometheus.gcp.mercadona.com/new/graph?g0.expr=sum%20by(namespace%2C%20pod)%20(max%20by(namespace%2C%20pod)%20(kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22.*%22%2Cphase%3D~%22Pending%7CUnknown%22%7D)%20*%20on(namespace%2C%20pod)%20group_left(owner_kind)%20max%20by(namespace%2C%20pod%2C%20owner_kind)%20(kube_pod_owner%7Bowner_kind!%3D%22Job%22%7D))%20%3E%200&g0.tab=1&g0.stacked=0&g0.range_input=1h>

This is a problem because there are namespaces with the same name in 
different clusters; or in other cases there is no way to be sure of the pod 
location except by looking for it manually.

Is there any way to keep the original external labels in the final alert, 
or are they lost in the expression evaluation, and then replaced by the 
labels from the Prometheus that evaluates and sends the alert?

Thanks in advance for your assistance.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d98ca7a5-114d-41b7-b14d-482654322552o%40googlegroups.com.

Reply via email to