Hey.

I have some troubles understanding how to do things right™ with respect
to alerting.

In principle I'd like to do two things:

a) have certain alert rules run only for certain instances
   (though that may in practise actually be less needed, when only the
   respective nodes would generate the respective metrics - not sure
   yet, whether this will be the case)
b) silence certain (or all) alerts for a given set of instances
   e.g. these may be nodes where I'm not an admin how can take action
   on an incident, but just view the time series graphs to see what's
   going on


As example I'll take an alert that fires when the root fs has >85%
usage:
   groups:
     - name:       node_alerts
       rules:
       - alert: node_free_fs_space
         expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85



With respect to (a):
I could of course a yet another key like:
   instance=~"someRegexThatDescribesMyInstances"
to each time series, but when that regex gets more complex, everything
becomes quite unreadable and it's quite error prone to forget about a
place (assuming one has many alerts) when the regex changes.

Is there some way like defining host groups or so? Where I have a
central place where I could define the list of hosts respectively a
regex for that... and just use the name of that definition in the
actual alert rules?


With respect to (b):
Similarly to above,... if I had various instances for which I'd never
wanted to see any alerts, I could of course add a regex to all my
alerts.
But seems quite ugly to clutter up all the rules just for a potentially
long list/regex of things for which I don't want to see anyway.

Another idea I had was that I do the filtering/silencing in the
alertmanager config at route level:
Like by adding a "ignore" route, that matches via regex on all the
instances I'd like to silence (and have a mute_time_interval set to
24/7), before any other routes match.

But AFAIU this would only suppress the message (e.g. mail), but the
alert would still show up in the alertmanager webpages/etc. as firing.



Not sure whether anything can be done better via adding labels at some
stage.
- Doing external_labels: in prometheus config doesn't seem to help
  here (only stact values?)
- Same for labels: in <static_config> in prometheus config.
- Setting some "noalerts" label via <relabel_config> in prometheus
  config would also set that in the DB, right?
  This I rather wouldn't want.

- Maybe using:
  alerting:
    alert_relabel_configs:
      - <relabel_config>
  would work? Like matching hostnames on instance and replacing with
  e.g. "yes" in some "noalerts" target?
  And then somehow using that in the alert rules...
  
  But also sounds a bit ugly, TBH.


So... what's the proper way to do this? :-)


Thanks,
Chris.


btw: Is there any difference between:
1) alerting:
     alert_relabel_configs:
       - <relabel_config>
and
2) the relabel_configs: in <alertmanager_config> 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/03619286babc6b2ee9d3295e235016b4e3b383ca.camel%40gmail.com.

Reply via email to