Re: [prometheus-users] External labels overwritten in alerts

Sohaib Omar Fri, 12 Jun 2020 08:47:26 -0700

>
> or are they lost in the expression evaluation, and then replaced by the
> labels from the Prometheus that evaluates and sends the alert?


I guess this is what's happening probably.

Is there any way to keep the original external labels in the final alert

I guess your best bet is to either change the names of the external labels
of the central(utils) Prometheus or re-write labels of child clusters at
scrape time, using scrape_relabel_config.

something like this:
relabel_config
regex = "cluster"
replace = "child_cluster"
action = keep.

On Fri, Jun 12, 2020 at 8:34 PM dgarciad <[email protected]> wrote:

> Hello.
>
> First of all, english is not my native language so excuse me if I cannot
> explain myself well enough.
>
> I am facing the following situation:
>
> - Several Prometheus servers deployed in clusters dedicated to development
> environments (dev, itg, pre, pro), federated against a central one (utils).
> All of them in HA configuration.
> - Each of the Prometheus has external labels configured (dev, itg, pre,
> pro, utils), for example:
> externalLabels:
>   cluster: dev-gke-cluster
>   environment: dev
> - Only 1 alertmanager deployed alongside the central Prometheus, in HA
> configuration.
> - honor labels is enabled for federated targets.
>
> Prometheus was deployed with helm chart prometheus-operator.
>
> The problem here is, with some of the prometheus-operator default alert
> rules I am not able to tell where the alert comes from, because the
> external labels get overwritten.
> For example, with the KubePodNotReady alert:
>
> In Prometheus alerts tab:
>
> Annotations
>
> message
> Pod demo-apps-devops-back/fwk-springboot-service-example-969897fd4-6c6gd
> has been in a non-ready state for longer than 15 minutes.
> runbook_url
>
> https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready
> alertname=KubePodNotReadynamespace=elasticpod=filebeat-filebeat-m8rgl
> severity=critical FIRING 2020-06-12T06:03:39.737453697Z 1e+00
>
> In Alertmanager:
>
> 06:18:09, 2020-06-12 (UTC)InfoSource
> <http://prometheus.gcp.mercadona.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28max+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22.%2A%22%2Cphase%3D~%22Pending%7CUnknown%22%7D%29+%2A+on%28namespace%2C+pod%29+group_left%28owner_kind%29+max+by%28namespace%2C+pod%2C+owner_kind%29+%28kube_pod_owner%7Bowner_kind%21%3D%22Job%22%7D%29%29+%3E+0&g0.tab=1>
> Silence
> <https://alertmanager.gcp.mercadona.com/#/silences/new?filter=%7Balertname%3D%22KubePodNotReady%22%2C%20cloud%3D%22gcp%22%2C%20cluster%3D%22mdona-cloud-utils-gke-cluster%22%2C%20environment%3D%22utils%22%2C%20namespace%3D%22demo-apps-devops-back%22%2C%20pod%3D%22fwk-springboot-service-example-969897fd4-6c6gd%22%2C%20prometheus%3D%22prometheus%2Fprometheus-prometheus-oper-prometheus%22%2C%20region%3D%22europe-west%22%2C%20severity%3D%22critical%22%7D>
> cloud="gcp"+
> cluster="utils-gke-cluster"+
> environment="utils"+
> namespace="demo-apps-devops-back"+
> pod="fwk-springboot-service-example-969897fd4-6c6gd"+
> prometheus="prometheus/prometheus-prometheus-oper-prometheus"+
> region="europe-west"+
> severity="critical"
>
> This alert refers to a pod and namespace that do not exist in the "utils"
> environment, but in the "dev" one, even though we defined the environment
> external label. All the labels here belong to the "utils" Prometheus, where
> all the metrics are gathered and from where the alerts are generated.
>
> We have found that this happens whenever the alert rule expression has any
> type of aggregation, as the one in the example:
>
> sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job=
> "kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(
> namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind)
> (kube_pod_owner{owner_kind!="Job"})) > 0
> <https://prometheus.gcp.mercadona.com/new/graph?g0.expr=sum%20by(namespace%2C%20pod)%20(max%20by(namespace%2C%20pod)%20(kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22.*%22%2Cphase%3D~%22Pending%7CUnknown%22%7D)%20*%20on(namespace%2C%20pod)%20group_left(owner_kind)%20max%20by(namespace%2C%20pod%2C%20owner_kind)%20(kube_pod_owner%7Bowner_kind!%3D%22Job%22%7D))%20%3E%200&g0.tab=1&g0.stacked=0&g0.range_input=1h>
>
> This is a problem because there are namespaces with the same name in
> different clusters; or in other cases there is no way to be sure of the pod
> location except by looking for it manually.
>
> Is there any way to keep the original external labels in the final alert,
> or are they lost in the expression evaluation, and then replaced by the
> labels from the Prometheus that evaluates and sends the alert?
>
> Thanks in advance for your assistance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/d98ca7a5-114d-41b7-b14d-482654322552o%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/d98ca7a5-114d-41b7-b14d-482654322552o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHo%3DpzBW0zPxPHyPWcPuyv6pM2EfHTxPU1%3DQtpvh6db0UKuTKQ%40mail.gmail.com.

Re: [prometheus-users] External labels overwritten in alerts

Reply via email to