Also, since you mentioned hanging network filesystem, is there any
way/logic to find out whether my NFS mount is hanged on a machine or not? I
have busted my ass on getting this result, must have tried more than 50
things but still have nothing in this matter.
In our setup we use a lot of NFS and some of the mounts are really
critical. All these shared NFS mounts are taken from a 3rd party vendor and
due to network lag or IP mismatch or 10 other reasons, the NFS ends up
being hanged on a machine or two. I need to know whenever this happens.
Anything that can be done here?
On Saturday, March 14, 2020 at 10:06:38 PM UTC+5:30, Christian Hoffmann
wrote:
>
> On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote:
> > Can you explain in a little detail please?
> I'll try to walk through your example in several steps:
>
> ## Step 1
> Your initial expression was this:
>
> (node_load15 > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"})) * on(instance)
> group_left(nodename) node_uname_info
>
>
> ## Step 2
> Let's drop the info part for now to make things simpler (you can add it
> back at the end):
>
> node_load15 > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"})
>
>
> ## Step 3
> With that query, you could add a factor. The simplest way would be, to
> have two alerts, one for your machines with the 1x factor, one with the
> 2x factor
>
> node_load15{instance=~"a|b|c"} > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"})
>
> and
>
> node_load15{instance!~"a|b|c"} > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"}) * 2
>
>
> ## Step 4
> Depending on your use case, this may be enough already. However, you
> would need to modify those two alerts whenever you add a machine. So,
> something more scalable would be using a metric (e.g. from a recording
> rule) for the scale factor:
>
> node_load15 > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"}) * on(instance)
> cpu_core_scale_factor
>
> This would require that you have a recording rule for each and every of
> your machines:
>
> - record: cpu_core_scale_factor
> labels:
> instance: a
> expr: 1
> - record: cpu_core_scale_factor
> labels:
> instance: c
> expr: 2 # factor two
>
>
> ## Step 5
> A further simplification regarding maintenance would be, if you could
> omit those entries for your more prominent case (just the number of
> cores, no multiplication factor).
> This is what the linked blog post describes. Sadly, it complicates the
> alert rule a little bit:
>
>
> node_load15 > count without (cpu, mode)
> (node_cpu_seconds_total{mode="system"}) * on(instance) group_left() (
> cpu_core_scale_factor
> or on(instance)
> node_load15*0 + 1 # <-- the "1" is the default value
> )
>
> The part after group_left() basically returns the value from your factor
> recording rule. If it doesn't exist, it calculates a default value. This
> works by taking an arbitrary metric which exists exactly once for each
> instance. It makes sense to take the same metric which your alert is
> based on. The value is multiplied by 0, as we do not care about the
> value at all. We then add 1, the default value you wanted. Essentially,
> this leads to a temporary, invisible metric. This part might be a bit
> hard to get across, but basically you can just copy this pattern verbatim.
>
> In this case, you would only need to add a recording rule for those
> machines which should have a non-default (i.e. other than 1) cpu count
> scale factor (i.e. the "instance: c" rule above).
>
> # Step 6
> As a last suggestion, you might want to revisit if strict alerting on
> the system load is so useful at all. In our setup, we do alert on it,
> but only on really high values which should only trigger if the load is
> skyrocketing (usually due to some hanging network filesystem or other
> deadlock situation).
>
>
> Note: All examples are untested, so take them with a grain of salt. I
> just want to get the idea across.
>
> Hope this helps,
> Christian
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2219d1ee-f229-42f2-899e-f01e86b250c4%40googlegroups.com.