[prometheus-users] Best practices when handling exceptions in alerts

Adrian Popa Wed, 12 Jan 2022 23:41:34 -0800

Hello everyone!
I'd like to learn about some best practices on handling exception cases in
alerts. Let's say we are monitoring "node_exporter" metrics like system
load or disk space used. Most servers typically fall below the alert
threshold, but a few (1-2) run above or close to the threshold as part of
normal operation.


What is the best way to have alerts when metric X passes a threshold for
most servers, but for the ones that are already running close to X, set a
different rule?

In my case, a few servers typically have high cpu usage, while others have
high disk space usage.

Should I create different rules and filter by job? This looks like it
wouldn't scale if I get more servers closer to the threshold in the future.

Should I increase the threshold for all? In this case some typically idle
servers might get overloaded and I wouldn't be notified until it's too late.

Should I add the threshold in a label and maintain it per server? Can I
have defaults in a simple way and only use the label to do overrides? This
should reduce the number of rules I need to maintain.

As example rules I'm currently using:

>   - alert: high_cpu_load
>     expr: node_load1{send_alerts="True"} > 5
>     for: 10m
>     labels:
>       severity: warning
>     annotations:
>       summary: "Server under high load"
>       description: "[{{$labels.job}}] Host is under high load, the avg
> load 1m is at {{$value}}. Reported by instance {{ $labels.instance }}."
>
   - alert: high_storage_load
>     expr: (node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} -
> node_filesystem_free_bytes{fstype="ext4", send_alerts="True"}) /
> node_filesystem_size_bytes{fstype="ext4", send_alerts="True"}  * 100 > 85
>     for: 10m
>     labels:
>       severity: warning
>     annotations:
>       summary: "Server storage is almost full"
>       description: "[{{$labels.job}}] Host storage usage is {{ humanize
> $value}}%. Reported by instance {{ $labels.instance }}."
>

Thanks for any adivce!
Regards,
Adrian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAOKso16nML9474LpnSd1YLJske1K%2Bd8k-4zTM%3D8S3P-cE7Mz4g%40mail.gmail.com.

[prometheus-users] Best practices when handling exceptions in alerts

Reply via email to