[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Brian Candler Wed, 26 Apr 2023 02:29:30 -0700

P.S. Your expression

>    expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
 
>= 85


can be simplified to:

>    expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / node_filesystem_size_bytes) >= 85

That's because the result instant vector for an expression like "foo / bar" 
only includes entries where the label sets match on left and right hand 
sides.  Any others are dropped silently.  (This form may be slightly less 
efficient, but I wouldn't expect it to be a problem unless you have 
hundreds of thousands of filesystems)

I would be inclined to simplify it further to:

>    expr: 
node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
/ node_filesystem_size_bytes < 0.15

You can use {{ $value | humanizePercentage }} in your alert annotations to 
show readable percentages.

On Wednesday, 26 April 2023 at 08:14:35 UTC+1 Brian Candler wrote:

> > I guess with (2) you also meant having a route which is then permanently 
> muted?
>
> I'd use a route with a null receiver (i.e. a receiver which has no 
> <transport>_configs under it)
>
> > b) The idea hat I had above:
> > - using <alert_relabel_configs> to filter on the instances and add a 
> label if it should be silenced
> > - use only that label in the expr instead of the full regex
> > But would that even work?
>
> No, because as far as I know alert_relabel_configs is done *after* the 
> alert is generated from the alerting rule. It's only used to add extra 
> labels before sending the generated alert to alertmanager. (It occurs to me 
> that it *might* be possible to use 'drop' rules here to discard alerts; 
> that would be a very confusing config IMO)
>
> > For me it's really like this:
> > My Prometheus instance monitors:
> > - my "own" instances, where I need to react on things like >85% usage on 
> root filesystem (and thus want to get an alert)
> > - "foreign" instances, where I just get the node exporter data and show 
> e.g. CPU usage, IO usage, and so on as a convenience to users of our 
> cluster - but any alert conditions wouldn't cause any further action on my 
> side (and the guys in charge of those servers have their own monitoring)
>
> In this situation, and if you are using static_configs or file_sd_configs 
> to identify the hosts, then I would simply use a target label (e.g. 
> "owner") to distinguish which targets are yours and which are foreign; or I 
> would use two different scrape jobs for self and foreign (which means the 
> "job" label can be used to distinguish them)
>
> The storage cost of having extra labels in the TSDB is essentially zero, 
> because it's the unique combination of labels that identifies the 
> timeseries - the bag of labels is mapped to an integer ID I believe.  So 
> the only problem is if this label changes often, and to me it sounds like a 
> 'local' or 'foreign' instance remains this way indefinitely.
>
> If you really want to keep these labels out of the metrics, then having a 
> separate timeseries with metadata for each instance is the next-best 
> option. Suppose you have a bunch of metrics with an 'instance' label, e.g.
>
> node_filesystem_free_bytes(instance="bar", ....}
> node_filesystem_size_bytes(instance="bar", ....}
> ...
>
> as the actual metrics you're monitoring, then you create one extra static 
> timeseries per host (instance) like this:
>
> meta{instance="bar",owner="self",site="london"} 1
>
> (aside: TSDB storage for this will be almost zero, because of the 
> delta-encoding used). These can be created by scraping a static webserver, 
> or by using recording rules.
>
> Then your alerting rules can be like this:
>
> expr: |
>   (
>      ... normal rule here ...
>   ) * on(instance) group_left(site) meta{owner="self"}
>
> The join will:
> * Limit alerting to those hosts which have a corresponding 'meta' 
> timeseries (matched on 'instance') and which has label owner="self"
> * Add the "site" label to the generated alerts
>
> Beware that:
>
> 1. this will suppress alerts for any host which does not have a 
> corresponding 'meta' timeseries. It's possible to work around this to 
> default to sending rather than not sending alerts, but makes the 
> expressions more complex:
> https://www.robustperception.io/left-joins-in-promql
>
> 2.  the "instance" labels must match exactly. So for example, if you're 
> currently scraping with the default label instance="foo:9100" then you'll 
> need to change this to instance="foo" (which is good practice anyway).  See
> https://www.robustperception.io/controlling-the-instance-label
>
> (I use some relabel_configs tricks for this; examples posted in this group 
> previously)
>
> > From all that it seems to me that the "best" solution is either:
> > a) simply making more complex and error prone alert rules, that filter 
> out the instances in the first place, like in:
> >    expr: 100 - 
> ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
>  
> * 100) / 
> node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
>  
> >= 85
>
> That's not great, because as you observe it will become more and more 
> complex over time; and in any case won't work if you want to treat certain 
> combinations of labels differently (e.g. stop alerting on a specific 
> *filesystem* on a specific host)
>
> If you really don't want to use either of the solutions I've given above, 
> then another way is to write some code to preprocess your alerting rules, 
> i.e. expand a single template rule into a bunch of separate rules, based on 
> your own templates and data sources.
>
> HTH,
>
> Brian.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ae3aa23e-a67d-41a2-a3c6-805487ec817cn%40googlegroups.com.

[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Reply via email to