[prometheus-users] Re: Alertmanager slack alerting issues

Brian Candler Mon, 22 Aug 2022 23:46:04 -0700

Yes, you've got it.  It's easy to test your hypothesis: simply paste the 
alert rule expression


    100 - (*avg by(instance,cluster)* 
(rate(node_cpu_seconds_total{mode="idle"}[2m])) 
* 100) > 95

into the PromQL query browser in the prometheus web interface, and you'll 
see all the results - including their labels.

I believe you'll get results like

{instance="foo",cluster="bar"} 98.4

There won't be any "env" label there because you've aggregated it away.

Try using: *avg by(instance,cluster,env)* instead.

Or you could have separate alerting rules per environment, and re-apply the 
label in your rule:

    expr: 100 - (*avg by(instance,cluster)* 
(rate(node_cpu_seconds_total{env="dev",mode="idle"}[2m])) 
* 100) > 98
    labels:
      env: dev

On Monday, 22 August 2022 at 21:21:51 UTC+1 rs wrote:

> Thanks Brian, I am in the midst of setting up a slack receiver (to weed 
> out the alerts going to the wrong channel). One thing I have noticed is, 
> the alerts being routed incorrectly may actually have to do with a rule:
>
> - alert: High_Cpu_Load
>
> expr: 100 - (*avg by(instance,cluster)* 
> (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95
>
> for: 0m
>
> labels:
>
> severity: warning
>
> annotations:
>
> summary: Host high CPU load (instance {{ $labels.instance }})
>
> description: "CPU load is > 95%\n INSTANCE = {{ $labels.instance }}\n 
> VALUE = %{{ $value | humanize }}\n LABELS = {{ $labels }}"
>
> I believe the issue may be that I'm not passing in 'env' into the 
> expression and that is causing an issue with the alerts. Just a hunch, but 
> I appreciate you pointing me in the right direction!
>
> On Monday, August 22, 2022 at 3:06:47 PM UTC-4 Brian Candler wrote:
>
>> "Looks correct but still doesn't work how I expect"
>>
>> What you've shown is a target configuration, not an alert arriving at 
>> alertmanager.
>>
>> Therefore, I'm suggesting you take a divide-and-conquer approach.  First, 
>> work out which of your receiver routing rules is being triggered (is it the 
>> 'production' receiver, or is it the 'slack' receiver?) by making them 
>> different.  This will point to which routing rule is or isn't being 
>> triggered.  And then you can work out why.
>>
>> There are all sorts of reasons it might not work, other than the config 
>> you've shown.  For example, if you have any target rewriting or metric 
>> rewriting rules which set the env; if the exporter itself sets "env" and 
>> you have honor_labels set; and so on.
>>
>> Hence the first part is to find out from real alert events: is the alert 
>> being generated without the "dev" label? In that case alert routing is just 
>> fine, and you need to work out why that label is wrong (and you're looking 
>> at the prometheus side). Or is the alert actually arriving at alertmanager 
>> with the "dev" label, in which case you're looking at the alertmanager side 
>> to find out why it's not being routed as expected.
>>
>> On Monday, 22 August 2022 at 18:45:25 UTC+1 rs wrote:
>>
>>> I checked the json file and the tagging was correct. Here's an example:
>>>
>>>
>>>    {
>>>
>>>        "labels": {
>>>
>>>            "cluster": "X Stage Servers",
>>>
>>>            "env": "dev"
>>>
>>>        },
>>>
>>>        "targets": [
>>>
>>>            "x:9100",
>>>
>>>            "y:9100",
>>>
>>>            "z:9100"
>>>
>>>        ]
>>>
>>>    },
>>> This is being sent to the production/default channel.
>>>
>>> On Friday, August 12, 2022 at 11:29:34 AM UTC-4 Brian Candler wrote:
>>>
>>>> Firstly, I'd drop the "continue: true" lines. They are not required, 
>>>> and are just going to cause confusion.
>>>>
>>>> The 'slack' and 'production' receivers are both sending to 
>>>> #prod-channel.  So you'll hit this if the env is not exactly "dev".  I 
>>>> suggest you look in detail at the alerts themselves: maybe they're tagging 
>>>> with "Dev" or "dev " (with a hidden space).
>>>>
>>>> If you change the default 'slack' receiver to go to a different 
>>>> channel, or use a different title/text template, it will be easier to see 
>>>> if this is the problem or not.
>>>>
>>>>
>>>> On Friday, 12 August 2022 at 09:36:22 UTC+1 rs wrote:
>>>>
>>>>> Hi everyone! I am configuring alertmanager to send outputs to a prod 
>>>>> slack channel and dev slack channel. I have checked with the routing tree 
>>>>> editor and everything should be working correctly. 
>>>>> However, I am seeing some (not all) alerts that are tagged with 'env: 
>>>>> dev' being sent to the prod slack channel. Is there some sort of old 
>>>>> configuration caching happening? Is there a way to flush this out?
>>>>>
>>>>> --- Alertmanager.yml ---
>>>>> global:
>>>>>   http_config:
>>>>>     proxy_url: 'xyz'
>>>>> templates:
>>>>>   - templates/*.tmpl
>>>>> route:
>>>>>   group_by: [cluster,alertname]
>>>>>   group_wait: 10s
>>>>>   group_interval: 30m
>>>>>   repeat_interval: 24h
>>>>>   receiver: 'slack'
>>>>>   routes:
>>>>>   - receiver: 'production'
>>>>>     match:
>>>>>       env: 'prod'
>>>>>     continue: true
>>>>>   - receiver: 'staging'
>>>>>     match:
>>>>>       env: 'dev'
>>>>>     continue: true
>>>>> receivers:
>>>>> #Fallback option - Default set to production server
>>>>> - name: 'slack'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#prod-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>> - name: 'staging'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#staging-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>> - name: 'production'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#prod-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0cd92b20-f8c6-4dbe-b136-c829ae202258n%40googlegroups.com.

[prometheus-users] Re: Alertmanager slack alerting issues

Reply via email to