> [...]
> (1) as I said, it's a bad way to build a monitoring system

It is of course cleaner, in a theoretical way, to place the thresholds in a 
separate location, and not change the disk metrics every time your relabel 
them in order to move disks to a separate threshold alert.

But I am not convinced that your solution is better in practice, especially 
for small networks like mine.

What you are suggesting is actually a work-around. It feels like Prometheus 
is missing an easy way to assign alerts to an arbitrary set of metrics, so 
you have to simulate metrics to provide thresholds. Then you can use 
existing PromQL syntax to check those thresholds against their disks.

If I understood the idea correctly, these virtual metrics would just 
provide the same values for all timestamps, because only the disk instance 
is relevant. That is an indication that the concept is not clean, just a 
work-around. Those "virtual" metrics are going to waste data space, because 
they are real time-series as far as Prometheus is concerned. They are going 
to double the number of windows_logical_disk_free_bytes time-series, 
because each disk instance metric will need a threshold counterpart. If you 
have thousands of disks, you can argue that this solution does not scale 
well either.

Associating disks with alert thresholds on an arbitrary basis is a very 
common requirement. I think that is going to happen all over the place. For 
example, you may have many thermometers measuring temperatures in the same 
way, but each temperature may require its own alert threshold. I am 
surprised that Prometheus makes this hard to achieve.

Your solution seems to be designed to assign an independent threshold per 
disk, but the most common scenario is that you will only have a small 
number of thresholds. For example, all Windows system disks (normally C:) 
would need an alert threshold, all Linux system disks will have another 
threshold. There will probably be a small number of data disk categories, 
say log disks, photo disks and document disks, and each one will need a 
separate alert threshold. But it is improbable that every disk will need a 
custom threshold. Similarly, if you are alerting based on temperatures, you 
will probably have groups too, like ambient temperature, fridge temperature 
and freezer temperatures. Not every thermometer will need a custom alert 
threshold.

So you need one alarm per threshold, and then a way to assign arbitrary 
disks or thermometers to one alarm. The easiest way now is probably to use 
labels. But in fact you are looking for a switch statement:

switch ( computer-instance, disk-volume )
{
  case PC1, Volume1: Assign to Alert A.
  case PC3, Volume3: Assign to Alert K.
  default:           Assign to Alert M.
}


> (2) in the limit, you will end up with a separate rewriting rule for 
every instance+volume combination

That's not too bad. The rewriting rules are just adding a label to each 
disk. It is perhaps rather verbose, the way the Prometheus syntax is, but 
those rewriting rules are (or can be) close to the disks they apply to. 
After all, you have to decide which threshold to apply to each disk, and 
where exactly you do that, or how verbose it is, does not make much 
difference in my opinion.


> This doesn't scale to thousands of alerting rules, but neither does 
metric relabeling with thousands of rules.

- If you solve this problem with alarms, you have to write or modify an 
alarm per disk you add. You may end up with many alarms.
- If you solve this problem with relabeling, you have to create or modify a 
label rewriting rule per disk you add. You may end up with many rewriting 
rules.
- If you solve this problem with virtual threshold metrics, you have to 
create a virtual metric per disk you add. You may end up with many metrics.

The difference in scalability is not great, as far as I can see (with my 
rather limited Prometheus knowledge).

Regards,
  rdiez

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9c9cbfda-b388-4f2d-913b-572052ea7678n%40googlegroups.com.

Reply via email to