Hello everyone!
I'd like to learn about some best practices on handling exception cases in
alerts. Let's say we are monitoring "node_exporter" metrics like system
load or disk space used. Most servers typically fall below the alert
threshold, but a few (1-2) run above or close to the threshold as part of
normal operation.
What is the best way to have alerts when metric X passes a threshold for
most servers, but for the ones that are already running close to X, set a
different rule?
In my case, a few servers typically have high cpu usage, while others have
high disk space usage.
Should I create different rules and filter by job? This looks like it
wouldn't scale if I get more servers closer to the threshold in the future.
Should I increase the threshold for all? In this case some typically idle
servers might get overloaded and I wouldn't be notified until it's too late.
Should I add the threshold in a label and maintain it per server? Can I
have defaults in a simple way and only use the label to do overrides? This
should reduce the number of rules I need to maintain.
As example rules I'm currently using:
> - alert: high_cpu_load
> expr: node_load1{send_alerts="True"} > 5
> for: 10m
> labels:
> severity: warning
> annotations:
> summary: "Server under high load"
> description: "[{{$labels.job}}] Host is under high load, the avg
> load 1m is at {{$value}}. Reported by instance {{ $labels.instance }}."
>
- alert: high_storage_load
> expr: (node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} -
> node_filesystem_free_bytes{fstype="ext4", send_alerts="True"}) /
> node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} * 100 > 85
> for: 10m
> labels:
> severity: warning
> annotations:
> summary: "Server storage is almost full"
> description: "[{{$labels.job}}] Host storage usage is {{ humanize
> $value}}%. Reported by instance {{ $labels.instance }}."
>
Thanks for any adivce!
Regards,
Adrian
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/CAOKso16nML9474LpnSd1YLJske1K%2Bd8k-4zTM%3D8S3P-cE7Mz4g%40mail.gmail.com.