[prometheus-users] Re: Alertmanger flapping when query returns very similar results

[email protected] Thu, 22 Oct 2020 08:35:15 -0700

> But when the query returns many results This causes alertmanager problems.
> sum by returns too many and puts alertmanager to its knees which breaks 
our alerting in general


This is a little too vague. What problems are you referring to? Are you 
seeing performance issues with alertmanager when there are too many alerts 
or is it usability problem when you get multiple alerts for the same 
underlaying problem?
On Thursday, 22 October 2020 at 16:24:56 UTC+1 [email protected] 
wrote:

> I have a query to find out if space is running out:
> (100 - (100 * 
> node_filesystem_avail_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}
>  
> / 
> node_filesystem_size_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}))
>
> For simplicity lets substitute this with SIZE_QUERY
>
> This VM is very special because there are multiple metrics that are 
> equivalent.
> I have two categories of mounts on the host:
>
> These group of mounts share the underlying storage and have duplicated 
> values (Note for brevity only 2 out of many are included)
> {device="$DEVICE1",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/first"}
>  
> 86.6186759625663
> {device="$DEVICE2",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/second"}
>  
> 86.6186759625663
>
> These group of mounts do not share underlying storage
> {device="$DEVICE3",fstype="$FS2",instance="$INSTANCE1",job="special_host",mountpoint="/var/log"}
>  
> 85.1214545444532
>
> I want to alert when any single host is above the threshold. When the 
> instance is not in the "shared" group, this is trivial. But when the query 
> returns many results This causes alertmanager problems. 
>
> My promql knowledge is lacking on how to get around this limitation, but 
> these are the things I've tried. Each has a problemdoesn't
>
> topk- flaps between each of the alerting instances as the labels change.
> topk(1, sum by (instance, mountpoint, device) (SIZE_QUERY) > 80)
>
> sum by returns too many and puts alertmanager to its knees which breaks 
> our alerting in general
> sum by (device, instance) (SIZE_QUERY) > 80
> sum by (device, instance, mountpount) (SIZE_QUERY) > 80
>
> max doesn't show the labels which makes notifications hard to debug the 
> problem- what instance, what device?
> max(SIZE_QUERY > 80)
>
> Is there a possible solution to this I haven't considered
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bb19f2d8-d9d9-4ba4-91de-e97ddd889e29n%40googlegroups.com.

[prometheus-users] Re: Alertmanger flapping when query returns very similar results

Reply via email to