Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Yagyansh S. Kumar Sat, 14 Mar 2020 14:02:00 -0700

Also, since you mentioned hanging network filesystem, is there any 
way/logic to find out whether my NFS mount is hanged on a machine or not? I 
have busted my ass on getting this result, must have tried more than 50 
things but still have nothing in this matter.
In our setup we use a lot of NFS and some of the mounts are really 
critical. All these shared NFS mounts are taken from a 3rd party vendor and 
due to network lag or IP mismatch or 10 other reasons, the NFS ends up 
being hanged on a machine or two. I need to know whenever this happens. 
Anything that can be done here?


On Saturday, March 14, 2020 at 10:06:38 PM UTC+5:30, Christian Hoffmann 
wrote:
>
> On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote: 
> > Can you explain in a little detail please? 
> I'll try to walk through your example in several steps: 
>
> ## Step 1 
> Your initial expression was this: 
>
> (node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"})) * on(instance) 
> group_left(nodename) node_uname_info 
>
>
> ## Step 2 
> Let's drop the info part for now to make things simpler (you can add it 
> back at the end): 
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) 
>
>
> ## Step 3 
> With that query, you could add a factor. The simplest way would be, to 
> have two alerts, one for your machines with the 1x factor, one with the 
> 2x factor 
>
> node_load15{instance=~"a|b|c"} > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) 
>
> and 
>
> node_load15{instance!~"a|b|c"} > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * 2 
>
>
> ## Step 4 
> Depending on your use case, this may be enough already. However, you 
> would need to modify those two alerts whenever you add a machine. So, 
> something more scalable would be using a metric (e.g. from a recording 
> rule) for the scale factor: 
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * on(instance) 
> cpu_core_scale_factor 
>
> This would require that you have a recording rule for each and every of 
> your machines: 
>
> - record: cpu_core_scale_factor 
>   labels: 
>     instance: a 
>   expr: 1 
> - record: cpu_core_scale_factor 
>   labels: 
>     instance: c 
>   expr: 2 # factor two 
>
>
> ## Step 5 
> A further simplification regarding maintenance would be, if you could 
> omit those entries for your more prominent case (just the number of 
> cores, no multiplication factor). 
> This is what the linked blog post describes. Sadly, it complicates the 
> alert rule a little bit: 
>
>
> node_load15 > count without (cpu, mode) 
> (node_cpu_seconds_total{mode="system"}) * on(instance) group_left() ( 
>     cpu_core_scale_factor 
>   or on(instance) 
>     node_load15*0 + 1  # <-- the "1" is the default value 
> ) 
>
> The part after group_left() basically returns the value from your factor 
> recording rule. If it doesn't exist, it calculates a default value. This 
> works by taking an arbitrary metric which exists exactly once for each 
> instance. It makes sense to take the same metric which your alert is 
> based on. The value is multiplied by 0, as we do not care about the 
> value at all. We then add 1, the default value you wanted. Essentially, 
> this leads to a temporary, invisible metric. This part might be a bit 
> hard to get across, but basically you can just copy this pattern verbatim. 
>
> In this case, you would only need to add a recording rule for those 
> machines which should have a non-default (i.e. other than 1) cpu count 
> scale factor (i.e. the "instance: c" rule above). 
>
> # Step 6 
> As a last suggestion, you might want to revisit if strict alerting on 
> the system load is so useful at all. In our setup, we do alert on it, 
> but only on really high values which should only trigger if the load is 
> skyrocketing (usually due to some hanging network filesystem or other 
> deadlock situation). 
>
>
> Note: All examples are untested, so take them with a grain of salt. I 
> just want to get the idea across. 
>
> Hope this helps, 
> Christian 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2219d1ee-f229-42f2-899e-f01e86b250c4%40googlegroups.com.

Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Reply via email to