This has all been explained to you in another thread.

Read your config.  You have written:

 - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

   equal: ['Site','instance']


Think about it for a moment.  This says that:

1. this rule will only suppress alerts which have label severity=warning

2. the alerts which perform the suppressing must have label 
severity=critical

3. the source alert will only suppress the target alert if both the Site 
and instance labels are identical

But that's not what you want to do.  You've just copied an example, but the 
example doesn't do what you want.  Your requirements don't mention 
"warning" or "critical" alerts.  And furthermore, you're only matching 
alerts with equal "instance" label, and since your alerting rules all have 
sum(...) by (Site), those expressions will only have a Site label and no 
instance label (and no severity label either). Note that a missing 
'instance' label on both source and target will count as being 'equal', 
it's confusing and unnecessary.

So start with your description in English.  *Is there a way I can ignore 
the matched targets from "PowerOutageAlert" on the 
"LinuxGroup/WindowsGroup" alerts?*

Yes, write a rule 
<https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule> 
which says that.  A starting point might be like this (using the more 
modern "matchers 
<https://prometheus.io/docs/alerting/latest/configuration/#matcher>" 
syntax):

inhibit_rules:
  - source_matchers:
      - alertname=PowerOutageAlert
    target_matchers:
      - alertname=~'Linux Servers Down|Windows Servers Down'
    equal: ['Site']

I'm not guaranteeing that will work, because, I can't write that properly 
without seeing examples of the *actual alerts* with *all their labels*.  As 
I said in the other thread, you simply go to the Prometheus web interface 
or the Alertmanager web interface to see these.  Once you can see an 
example of an alert that you want to suppress, together with an alert that 
should do the suppression, you can easily write an inhibit rule which 
inhibits the first by the second.

I'm also wondering about your alerting rules.  Given that you've aggregated 
all the labels away apart from Site, I'm not sure *exactly* what you're 
trying to do with alerting.  I think you only want the Linux Servers Down 
alert to fire if *all* the Linux servers in a site have gone down, is that 
right?

That's OK, although it's not what people normally do; normally they 
generate a separate alert for each server, and then use alertmanager 
grouping so that a single alert message gets sent out, listing all the 
servers that are down.

Now, if you don't want to get alerts for individual servers going down, but 
only if *all* servers have gone down, that's a perfectly reasonable 
requirement.  But then that's such a major outage, I wouldn't want to be 
doing alert suppression.  I think I'd be doing grouping again.

What you can do is add a label like "severity=MajorOutage" to each of these 
alerts, and then group them on this label.

Then you'll get a single alert message, which contains a summary of all the 
information in one place:

- all my Linux servers have gone down
- all my Windows servers have gone down
- there's a power outage

A human being can quickly deduce the connection between these statements.  
And it's simpler than trying to suppress two major outage alerts because of 
a third major outage.  But either way will work.
On Monday, 5 September 2022 at 23:57:21 UTC+1 [email protected] wrote:

> Hi,
>
> I am new to prometheus and looking for some guidance on how to get my 
> prometheus rule work for the below requirement. 
>
> In my environment, all Linux servers are connected to Router-1 group and 
> all Windows server are connected to Router-2 group. I have configured the 
> prometheus rules based on the below requirement.
>
>    1. When there is a complete outage on a site, it needs to tell just 
>    the site numbers where all the targets are down. So I have configured a 
>    rule "*PowerOutageAlert*" and this is working fine as expected.
>    2. When the Linux server is down in a site, it needs to show which 
>    site linux servers are down. So I have configured a rule "*LinuxGroup*" 
>    and this is also working fine as expected. 
>    3. When the Windows server is down in a site, it needs to show which 
>    site Windows servers are down. So I have configured a rule "
>    *WindowsGroup*" and this is also working fine as expected. 
>
>
> *prometheus_rules.yml:*
>
> groups:
>
>  - name: PowerOutageAlert
>
>    rules:
>
>    - alert: *PowerOutageAlert*
>
>      expr: |
>
>        sum(probe_success{job="blackbox_linux"} or 
> probe_success{job="blackbox_windows"} or 
> probe_success{job="blackbox_router-1"} or 
> probe_success{job="blackbox_router-2"} by (Site) == 0
>
>      for: 1m
>
>  - name: *LinuxGroup*
>
>    rules:
>
>    - alert: Linux Servers Down
>
>      expr: |
>
>        sum(probe_success{job="blackbox_linux"} or 
> probe_success{job="blackbox_router-1"} by (Site) == 0
>
>      for: 1m
>
>  - name: *WindowsGroup*
>
>    rules:
>
>    - alert: Windows Servers Down
>
>      expr: |
>
>        sum(probe_success{job="blackbox_windows"} or 
> probe_success{job="blackbox_router-2"}) by (Site) == 0
>
>      for: 1m
>
> *Alertmanager.yml:*
>
> route:
>
>   group_by: ['alertname']
>
>   receiver: ms-teams
>
>   group_wait: 1m
>
>   group_interval: 1m
>
>   repeat_interval: 1m
>
> receivers:
>
> - name: ms-teams
>
>   webhook_configs:
>
>     - url: 'http://xx.xx.xx.xx:2000/alertmanager'
>
>       send_resolved: false
>
> inhibit_rules:
>
>   - source_match:
>
>       severity: 'critical'
>
>     target_match:
>
>       severity: 'warning'
>
>     equal: ['Site','instance']
>
> The issue I am facing now is:
>
>    1. When there is a complete outage on a site, I am getting 3 alerts (
>    *PowerOutageAlert*/*LinuxGroup*/*WindowsGroup*) for the same targets 
>    based on the above configuration. Is there a way I can ignore the matched 
>    targets from "*PowerOutageAlert*" on the "*LinuxGroup*/*WindowsGroup*" 
>    alerts?
>    2. As per the above setup for "*LinuxGroup*/*WindowsGroup*", it will 
>    throw alert only if the "blackbox_router-1/blackbox_linux" (or) 
>    "blackbox_router-2/blackbox_windows" server both goes down. And it wont 
>    alert if just the Linux/Windows server are down. How can I achieve it 
>    getting all alerts even if routers are up?
>
>
> On a Shell script I can achieve this by using "if else" conditions but I 
> am not sure how to use the same logics in the prometheus. Any help is 
> really appreciated.
>
>
> Thanks
> Sandosh
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8231b4a3-67a4-43c0-ae46-9a32bc83e593n%40googlegroups.com.

Reply via email to