Awesome explanation. This helps a lot. Thanks, I appreciate it.

On Saturday, March 14, 2020 at 10:06:38 PM UTC+5:30, Christian Hoffmann 
wrote:
>
> On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote: 
> > Can you explain in a little detail please? 
> I'll try to walk through your example in several steps: 
>
> ## Step 1 
> Your initial expression was this: 
>
> (node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"})) * on(instance) 
> group_left(nodename) node_uname_info 
>
>
> ## Step 2 
> Let's drop the info part for now to make things simpler (you can add it 
> back at the end): 
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) 
>
>
> ## Step 3 
> With that query, you could add a factor. The simplest way would be, to 
> have two alerts, one for your machines with the 1x factor, one with the 
> 2x factor 
>
> node_load15{instance=~"a|b|c"} > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) 
>
> and 
>
> node_load15{instance!~"a|b|c"} > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * 2 
>
>
> ## Step 4 
> Depending on your use case, this may be enough already. However, you 
> would need to modify those two alerts whenever you add a machine. So, 
> something more scalable would be using a metric (e.g. from a recording 
> rule) for the scale factor: 
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * on(instance) 
> cpu_core_scale_factor 
>
> This would require that you have a recording rule for each and every of 
> your machines: 
>
> - record: cpu_core_scale_factor 
>   labels: 
>     instance: a 
>   expr: 1 
> - record: cpu_core_scale_factor 
>   labels: 
>     instance: c 
>   expr: 2 # factor two 
>
>
> ## Step 5 
> A further simplification regarding maintenance would be, if you could 
> omit those entries for your more prominent case (just the number of 
> cores, no multiplication factor). 
> This is what the linked blog post describes. Sadly, it complicates the 
> alert rule a little bit: 
>
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * on(instance) group_left() ( 
>     cpu_core_scale_factor 
>   or on(instance) 
>     node_load15*0 + 1  # <-- the "1" is the default value 
> ) 
>
> The part after group_left() basically returns the value from your factor 
> recording rule. If it doesn't exist, it calculates a default value. This 
> works by taking an arbitrary metric which exists exactly once for each 
> instance. It makes sense to take the same metric which your alert is 
> based on. The value is multiplied by 0, as we do not care about the 
> value at all. We then add 1, the default value you wanted. Essentially, 
> this leads to a temporary, invisible metric. This part might be a bit 
> hard to get across, but basically you can just copy this pattern verbatim. 
>
> In this case, you would only need to add a recording rule for those 
> machines which should have a non-default (i.e. other than 1) cpu count 
> scale factor (i.e. the "instance: c" rule above). 
>
> # Step 6 
> As a last suggestion, you might want to revisit if strict alerting on 
> the system load is so useful at all. In our setup, we do alert on it, 
> but only on really high values which should only trigger if the load is 
> skyrocketing (usually due to some hanging network filesystem or other 
> deadlock situation). 
>
>
> Note: All examples are untested, so take them with a grain of salt. I 
> just want to get the idea across. 
>
> Hope this helps, 
> Christian 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e29cf4ff-c52a-40dc-9ae6-785c6111c64d%40googlegroups.com.

Reply via email to