On Thursday, 13 January 2022 at 07:41:33 UTC [email protected] wrote: > What is the best way to have alerts when metric X passes a threshold for > most servers, but for the ones that are already running close to X, set a > different rule? >
See https://www.robustperception.io/using-time-series-as-alert-thresholds for the direct answer to that question. You can also monitor on trends rather than static thresholds - e.g. for disk space you can use predict_linear to detect when a filesystem looks like it's going to become full. See this thread <https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit> . However, I'd also caution you against setting alerts on causes, and concentrate your alerting on symptoms instead. You can't avoid all cause-based alerts, but you can minimise them. "CPU load" for example, is not a particularly useful metric to alert on. Suppose the CPU load hits 99% at 3am in the morning, *but the service is still working fine.* Do you really want to get someone out of bed for this? And if you do get them out of bed, what exactly are they going to do about it anyway? This document, which is only a few pages, is well worth reading: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/43a0cd05-75a9-4a03-af2c-b29cad12435fn%40googlegroups.com.

