Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Christian Hoffmann Sat, 14 Mar 2020 09:37:01 -0700

On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote:
> Can you explain in a little detail please?
I'll try to walk through your example in several steps:


## Step 1
Your initial expression was this:

(node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})) * on(instance)
group_left(nodename) node_uname_info


## Step 2
Let's drop the info part for now to make things simpler (you can add it
back at the end):

node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})


## Step 3
With that query, you could add a factor. The simplest way would be, to
have two alerts, one for your machines with the 1x factor, one with the
2x factor

node_load15{instance=~"a|b|c"} > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})

and

node_load15{instance!~"a|b|c"} > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * 2


## Step 4
Depending on your use case, this may be enough already. However, you
would need to modify those two alerts whenever you add a machine. So,
something more scalable would be using a metric (e.g. from a recording
rule) for the scale factor:

node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * on(instance) cpu_core_scale_factor

This would require that you have a recording rule for each and every of
your machines:

- record: cpu_core_scale_factor
  labels:
    instance: a
  expr: 1
- record: cpu_core_scale_factor
  labels:
    instance: c
  expr: 2 # factor two


## Step 5
A further simplification regarding maintenance would be, if you could
omit those entries for your more prominent case (just the number of
cores, no multiplication factor).
This is what the linked blog post describes. Sadly, it complicates the
alert rule a little bit:


node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * on(instance) group_left() (
    cpu_core_scale_factor
  or on(instance)
    node_load15*0 + 1  # <-- the "1" is the default value
)

The part after group_left() basically returns the value from your factor
recording rule. If it doesn't exist, it calculates a default value. This
works by taking an arbitrary metric which exists exactly once for each
instance. It makes sense to take the same metric which your alert is
based on. The value is multiplied by 0, as we do not care about the
value at all. We then add 1, the default value you wanted. Essentially,
this leads to a temporary, invisible metric. This part might be a bit
hard to get across, but basically you can just copy this pattern verbatim.

In this case, you would only need to add a recording rule for those
machines which should have a non-default (i.e. other than 1) cpu count
scale factor (i.e. the "instance: c" rule above).

# Step 6
As a last suggestion, you might want to revisit if strict alerting on
the system load is so useful at all. In our setup, we do alert on it,
but only on really high values which should only trigger if the load is
skyrocketing (usually due to some hanging network filesystem or other
deadlock situation).


Note: All examples are untested, so take them with a grain of salt. I
just want to get the idea across.

Hope this helps,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e5c3930e-96b6-c1cb-a122-6cfa347c6ab2%40hoffmann-christian.info.

Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Reply via email to