[prometheus-users] Re: Different disk full thresholds for alerts

R. Diez Mon, 22 Nov 2021 09:25:48 -0800

> [...]
> (1) as I said, it's a bad way to build a monitoring system

It is of course cleaner, in a theoretical way, to place the thresholds in a 
separate location, and not change the disk metrics every time your relabel 
them in order to move disks to a separate threshold alert.

But I am not convinced that your solution is better in practice, especially
for small networks like mine.

What you are suggesting is actually a work-around. It feels like Prometheus
is missing an easy way to assign alerts to an arbitrary set of metrics, so
you have to simulate metrics to provide thresholds. Then you can use
existing PromQL syntax to check those thresholds against their disks.

If I understood the idea correctly, these virtual metrics would just
provide the same values for all timestamps, because only the disk instance
is relevant. That is an indication that the concept is not clean, just a
work-around. Those "virtual" metrics are going to waste data space, because
they are real time-series as far as Prometheus is concerned. They are going
to double the number of windows_logical_disk_free_bytes time-series,
because each disk instance metric will need a threshold counterpart. If you
have thousands of disks, you can argue that this solution does not scale
well either.

Associating disks with alert thresholds on an arbitrary basis is a very
common requirement. I think that is going to happen all over the place. For
example, you may have many thermometers measuring temperatures in the same
way, but each temperature may require its own alert threshold. I am
surprised that Prometheus makes this hard to achieve.

Your solution seems to be designed to assign an independent threshold per
disk, but the most common scenario is that you will only have a small
number of thresholds. For example, all Windows system disks (normally C:)
would need an alert threshold, all Linux system disks will have another
threshold. There will probably be a small number of data disk categories,
say log disks, photo disks and document disks, and each one will need a
separate alert threshold. But it is improbable that every disk will need a
custom threshold. Similarly, if you are alerting based on temperatures, you
will probably have groups too, like ambient temperature, fridge temperature
and freezer temperatures. Not every thermometer will need a custom alert
threshold.

So you need one alarm per threshold, and then a way to assign arbitrary
disks or thermometers to one alarm. The easiest way now is probably to use
labels. But in fact you are looking for a switch statement:

switch ( computer-instance, disk-volume )
{
case PC1, Volume1: Assign to Alert A.
case PC3, Volume3: Assign to Alert K.
default: Assign to Alert M.
}

> (2) in the limit, you will end up with a separate rewriting rule for
every instance+volume combination

That's not too bad. The rewriting rules are just adding a label to each
disk. It is perhaps rather verbose, the way the Prometheus syntax is, but
those rewriting rules are (or can be) close to the disks they apply to.
After all, you have to decide which threshold to apply to each disk, and
where exactly you do that, or how verbose it is, does not make much
difference in my opinion.

> This doesn't scale to thousands of alerting rules, but neither does
metric relabeling with thousands of rules.

- If you solve this problem with alarms, you have to write or modify an
alarm per disk you add. You may end up with many alarms.
- If you solve this problem with relabeling, you have to create or modify a
label rewriting rule per disk you add. You may end up with many rewriting
rules.
- If you solve this problem with virtual threshold metrics, you have to
create a virtual metric per disk you add. You may end up with many metrics.

The difference in scalability is not great, as far as I can see (with my
rather limited Prometheus knowledge).

Regards,
rdiez

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/9c9cbfda-b388-4f2d-913b-572052ea7678n%40googlegroups.com.

[prometheus-users] Re: Different disk full thresholds for alerts

Reply via email to