On Saturday, 20 November 2021 at 22:20:11 UTC [email protected] wrote: > The trouble is, I want a different alert threshold depending on the disk, > and the thresholds can be pretty arbitrary. > > What is the best way to achieve that? >
The direct answer is here: https://www.robustperception.io/using-time-series-as-alert-thresholds However, I've found it's better not to have static alert thresholds anyway. The problem is: a volume hits 90% full, but it's working fine, and isn't growing, and nobody wants to mess with data just to bring it back down to 89% to silence the alert. You obviously don't want the alert firing forever, so what do you do? Move the threshold to 91%, and repeat the whole thing later? Instead, I have two sets of disk space alerts. * A critical alert for "the disk is full, or near as dammit" (less than 100MB free), for any filesystem whose capacity is more than 120MB. This fires almost immediately. - alert: DiskFull expr: | node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 100000000 unless node_filesystem_size_bytes < 120000000 for: 5m labels: severity: critical annotations: summary: 'Filesystem full or less than 100MB free space' * Warning alerts for "the disk is filling up, and at this rate is going to be full soon" - name: DiskRate10m interval: 1m rules: # Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours - alert: DiskFilling expr: | predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 0 for: 20m labels: severity: warning annotations: summary: 'Filesystem will be full in less than 2h at current 10m growth rate' - name: DiskRate3h interval: 10m rules: # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days - alert: DiskFilling expr: | predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 0 for: 6h labels: severity: warning annotations: summary: 'Filesystem will be full in less than 2d at current 3h growth rate' - name: DiskRate12h interval: 1h rules: # Warn if rate of growth over last 12 hours means filesystem will fill in 7 days - alert: DiskFilling expr: | predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[12h], 7*86400) < 0 for: 24h labels: severity: warning annotations: summary: 'Filesystem will be full in less than 1w at current 12h growth rate' These are evaluated over different time periods and with different "for" periods, to reduce noise from filesystems which have a regular filling and emptying pattern. For example, I see some systems where the disk space grows and shrinks in an hourly or daily pattern. I would like to rework those expressions so that they return the estimated time-until-full (i.e. work out where the zero crossing takes place), but I never got round to it. In practice I also have to exclude a few noisy systems by applying more label filters: node_filesystem_avail_bytes{fstype!~"...",instance!~"...",mountpoint!~"..."} HTH, Brian. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/311742d3-ae1f-4eaf-826d-1e9ce893d60dn%40googlegroups.com.

