Hi,

I am new to prometheus and looking for some guidance on how to get my 
prometheus rule work for the below requirement. 

In my environment, all Linux servers are connected to Router-1 group and 
all Windows server are connected to Router-2 group. I have configured the 
prometheus rules based on the below requirement.

   1. When there is a complete outage on a site, it needs to tell just the 
   site numbers where all the targets are down. So I have configured a rule "
   *PowerOutageAlert*" and this is working fine as expected.
   2. When the Linux server is down in a site, it needs to show which site 
   linux servers are down. So I have configured a rule "*LinuxGroup*" and 
   this is also working fine as expected. 
   3. When the Windows server is down in a site, it needs to show which 
   site Windows servers are down. So I have configured a rule "
   *WindowsGroup*" and this is also working fine as expected. 


*prometheus_rules.yml:*

groups:

 - name: PowerOutageAlert

   rules:

   - alert: *PowerOutageAlert*

     expr: |

       sum(probe_success{job="blackbox_linux"} or 
probe_success{job="blackbox_windows"} or 
probe_success{job="blackbox_router-1"} or 
probe_success{job="blackbox_router-2"} by (Site) == 0

     for: 1m

 - name: *LinuxGroup*

   rules:

   - alert: Linux Servers Down

     expr: |

       sum(probe_success{job="blackbox_linux"} or 
probe_success{job="blackbox_router-1"} by (Site) == 0

     for: 1m

 - name: *WindowsGroup*

   rules:

   - alert: Windows Servers Down

     expr: |

       sum(probe_success{job="blackbox_windows"} or 
probe_success{job="blackbox_router-2"}) by (Site) == 0

     for: 1m

*Alertmanager.yml:*

route:

  group_by: ['alertname']

  receiver: ms-teams

  group_wait: 1m

  group_interval: 1m

  repeat_interval: 1m

receivers:

- name: ms-teams

  webhook_configs:

    - url: 'http://xx.xx.xx.xx:2000/alertmanager'

      send_resolved: false

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['Site','instance']

The issue I am facing now is:

   1. When there is a complete outage on a site, I am getting 3 alerts (
   *PowerOutageAlert*/*LinuxGroup*/*WindowsGroup*) for the same targets 
   based on the above configuration. Is there a way I can ignore the matched 
   targets from "*PowerOutageAlert*" on the "*LinuxGroup*/*WindowsGroup*" 
   alerts?
   2. As per the above setup for "*LinuxGroup*/*WindowsGroup*", it will 
   throw alert only if the "blackbox_router-1/blackbox_linux" (or) 
   "blackbox_router-2/blackbox_windows" server both goes down. And it wont 
   alert if just the Linux/Windows server are down. How can I achieve it 
   getting all alerts even if routers are up?


On a Shell script I can achieve this by using "if else" conditions but I am 
not sure how to use the same logics in the prometheus. Any help is really 
appreciated.


Thanks
Sandosh

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ba126a60-0c10-47d1-8e14-4b4833e15dd1n%40googlegroups.com.

Reply via email to