[prometheus-users] Re: Different disk full thresholds for alerts

Brian Candler Sun, 21 Nov 2021 02:15:57 -0800

On Saturday, 20 November 2021 at 22:20:11 UTC [email protected] wrote:

> The trouble is, I want a different alert threshold depending on the disk, 
> and the thresholds can be pretty arbitrary.
>
> What is the best way to achieve that?
>


The direct answer is here: 
https://www.robustperception.io/using-time-series-as-alert-thresholds

However, I've found it's better not to have static alert thresholds 
anyway.  The problem is: a volume hits 90% full, but it's working fine,  
and isn't growing, and nobody wants to mess with data just to bring it back 
down to 89% to silence the alert.  You obviously don't want the alert 
firing forever, so what do you do?  Move the threshold to 91%, and repeat 
the whole thing later?

Instead, I have two sets of disk space alerts.

* A critical alert for "the disk is full, or near as dammit" (less than 
100MB free), for any filesystem whose capacity is more than 120MB. This 
fires almost immediately.

  - alert: DiskFull
    expr: |
      node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 100000000 
unless node_filesystem_size_bytes < 120000000
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Filesystem full or less than 100MB free space'

* Warning alerts for "the disk is filling up, and at this rate is going to 
be full soon"

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling
    expr: |
      
predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0
    for: 20m
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 2h at current 10m 
growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling
    expr: |
      predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
2*86400) < 0
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 2d at current 3h 
growth rate'

- name: DiskRate12h
  interval: 1h
  rules:
  # Warn if rate of growth over last 12 hours means filesystem will fill in 
7 days
  - alert: DiskFilling
    expr: |
      predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[12h], 
7*86400) < 0
    for: 24h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 1w at current 12h 
growth rate'

These are evaluated over different time periods and with different "for" 
periods, to reduce noise from filesystems which have a regular filling and 
emptying pattern.  For example, I see some systems where the disk space 
grows and shrinks in an hourly or daily pattern.

I would like to rework those expressions so that they return the estimated 
time-until-full (i.e. work out where the zero crossing takes place), but I 
never got round to it.

In practice I also have to exclude a few noisy systems by applying more 
label filters:
node_filesystem_avail_bytes{fstype!~"...",instance!~"...",mountpoint!~"..."}

HTH,

Brian.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/311742d3-ae1f-4eaf-826d-1e9ce893d60dn%40googlegroups.com.

[prometheus-users] Re: Different disk full thresholds for alerts

Reply via email to