[prometheus-users] Re: Best practices when handling exceptions in alerts

Brian Candler Thu, 13 Jan 2022 00:00:34 -0800

On Thursday, 13 January 2022 at 07:41:33 UTC [email protected] wrote:

> What is the best way to have alerts when metric X passes a threshold for 
> most servers, but for the ones that are already running close to X, set a 
> different rule?
>


See https://www.robustperception.io/using-time-series-as-alert-thresholds 
for the direct answer to that question.

You can also monitor on trends rather than static thresholds - e.g. for 
disk space you can use predict_linear to detect when a filesystem looks 
like it's going to become full.  See this thread 
<https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>
.

However, I'd also caution you against setting alerts on causes, and 
concentrate your alerting on symptoms instead.   You can't avoid all 
cause-based alerts, but you can minimise them.

"CPU load" for example, is not a particularly useful metric to alert on.  
Suppose the CPU load hits 99% at 3am in the morning, *but the service is 
still working fine.*  Do you really want to get someone out of bed for 
this?  And if you do get them out of bed, what exactly are they going to do 
about it anyway?

This document, which is only a few pages, is well worth reading:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/43a0cd05-75a9-4a03-af2c-b29cad12435fn%40googlegroups.com.

[prometheus-users] Re: Best practices when handling exceptions in alerts

Reply via email to