[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Christoph Anton Mitterer Thu, 27 Apr 2023 16:16:46 -0700

On Wednesday, April 26, 2023 at 9:14:35 AM UTC+2 Brian Candler wrote:

> I guess with (2) you also meant having a route which is then permanently 
muted?

I'd use a route with a null receiver (i.e. a receiver which has no
<transport>_configs under it)

Ah, interesting. It wasn't even clear to me from the documentation, that
this works, but as you say - it does.

Nevertheless, it only suppresses the alert notifications, but e.g. within
the AlertManager they would still show up as firing (as expected).

> b) The idea hat I had above:
> - using <alert_relabel_configs> to filter on the instances and add a
label if it should be silenced
> - use only that label in the expr instead of the full regex
> But would that even work?

No, because as far as I know alert_relabel_configs is done *after* the
alert is generated from the alerting rule.

I've already assumed so from the documentation,.. thanks for confirmation.

It's only used to add extra labels before sending the generated alert to
alertmanager. (It occurs to me that it *might* be possible to use 'drop'
rules here to discard alerts; that would be a very confusing config IMO)

What do you mean by drop rules?

> For me it's really like this:
> My Prometheus instance monitors:
> - my "own" instances, where I need to react on things like >85% usage on
root filesystem (and thus want to get an alert)
> - "foreign" instances, where I just get the node exporter data and show
e.g. CPU usage, IO usage, and so on as a convenience to users of our
cluster - but any alert conditions wouldn't cause any further action on my
side (and the guys in charge of those servers have their own monitoring)

In this situation, and if you are using static_configs or file_sd_configs
to identify the hosts, then I would simply use a target label (e.g.
"owner") to distinguish which targets are yours and which are foreign; or I
would use two different scrape jobs for self and foreign (which means the
"job" label can be used to distinguish them)

I had thought about that too, but the downside of it would be that I have
to "hardcode" this into the labels within the TDSB. Even if storage is not
a concern, what might happen sometimes is that a formerly "foreign" server
moves into my responsibility.
Then I think things would get messy.

In general, TBH, to me its also not really clear what the best practise is
in terms of scrape jobs:

At one time I planned to use them to "group" servers that somehow belong
together, e.g. in the case of a job for data from the node exporter, I
would have made node_storage_servers, node_compute_servers or something
like that.
But then I felt this could actually cause troubles later on, when I want to
e.g. filter time series based on the job (or as above: when a server moves
its roles).

So right now I put everything (from one exporter) in one job.
Not really sure whether this is stupid or not ;-)

The storage cost of having extra labels in the TSDB is essentially zero,
because it's the unique combination of labels that identifies the
timeseries - the bag of labels is mapped to an integer ID I believe. So
the only problem is if this label changes often, and to me it sounds like a
'local' or 'foreign' instance remains this way indefinitely.

Arguably, for the above particular use case, it would be rather quite rare
that it changes.
But for the node_storage_servers vs. node_compute_servers case... it would
actually happen quite often in my environment.

If you really want to keep these labels out of the metrics, then having a
separate timeseries with metadata for each instance is the next-best
option. Suppose you have a bunch of metrics with an 'instance' label, e.g.

node_filesystem_free_bytes(instance="bar", ....}
node_filesystem_size_bytes(instance="bar", ....}
...

as the actual metrics you're monitoring, then you create one extra static
timeseries per host (instance) like this:

meta{instance="bar",owner="self",site="london"} 1

(aside: TSDB storage for this will be almost zero, because of the
delta-encoding used). These can be created by scraping a static webserver,
or by using recording rules.

Then your alerting rules can be like this:

expr: |
(
... normal rule here ...
) * on(instance) group_left(site) meta{owner="self"}

The join will:
* Limit alerting to those hosts which have a corresponding 'meta'
timeseries (matched on 'instance') and which has label owner="self"
* Add the "site" label to the generated alerts

Beware that:

1. this will suppress alerts for any host which does not have a
corresponding 'meta' timeseries. It's possible to work around this to
default to sending rather than not sending alerts, but makes the
expressions more complex:
https://www.robustperception.io/left-joins-in-promql

2. the "instance" labels must match exactly. So for example, if you're
currently scraping with the default label instance="foo:9100" then you'll
need to change this to instance="foo" (which is good practice anyway). See
https://www.robustperception.io/controlling-the-instance-label

That's a pretty neat idea. So meta would basically serve as an in-metric
encoded information about instance grouping/owner/etc.?

This should go to some howto or BCP document.

(I use some relabel_configs tricks for this; examples posted in this group
previously)

For the port removal, I came up with my own solution (based on some others
I've had found) ... and asked for inclusion in the documentation:
https://github.com/prometheus/docs/issues/2296

Not sure if you'd find mine proper ;-)

> From all that it seems to me that the "best" solution is either:
> a) simply making more complex and error prone alert rules, that filter
out the instances in the first place, like in:
> expr: 100 -
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}

* 100) /
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})

>= 85

That's not great, because as you observe it will become more and more
complex over time; and in any case won't work if you want to treat certain
combinations of labels differently (e.g. stop alerting on a specific
*filesystem* on a specific host)

If you really don't want to use either of the solutions I've given above,
then another way is to write some code to preprocess your alerting rules,
i.e. expand a single template rule into a bunch of separate rules, based on
your own templates and data sources.

It would be nice if Prometheus would have some more built-in means for
things like this. I mean your solution with meta above is nice, but also
adds some complexity (which someone who'd read my config) would then first
need to understand... plus there are, as you said, pitfalls, like your
point (1) above.

HTH,

It did, and was greatly appreciated :-)

Thanks,
Chris.

PS: Thanks also for your hints on how to improve the expression (which I've
had merely copied and pasted from the full node exporter grafana dashboard.
O:-)

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6af60a53-402b-4025-a49b-d6240c5b16cfn%40googlegroups.com.

[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Reply via email to