sum by returns hundreds of results (one for each mountpoint) and they each 
appear to generate an alert event.

When we try to consolidate that to a single event by using topk, the event 
changes rapidly between instances 

This is the only thing logs showing a problem:
level=warn ts=2020-10-22T14:10:32.912Z caller=delegate.go:272 
component=cluster msg="dropping messages because too many are queued" 
current=4103 limit=4096



On Thursday, October 22, 2020 at 9:34:45 AM UTC-6 [email protected] wrote:

> > But when the query returns many results This causes alertmanager 
> problems.
> > sum by returns too many and puts alertmanager to its knees which breaks 
> our alerting in general
>
> This is a little too vague. What problems are you referring to? Are you 
> seeing performance issues with alertmanager when there are too many alerts 
> or is it usability problem when you get multiple alerts for the same 
> underlaying problem?
> On Thursday, 22 October 2020 at 16:24:56 UTC+1 [email protected] 
> wrote:
>
>> I have a query to find out if space is running out:
>> (100 - (100 * 
>> node_filesystem_avail_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}
>>  
>> / 
>> node_filesystem_size_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}))
>>
>> For simplicity lets substitute this with SIZE_QUERY
>>
>> This VM is very special because there are multiple metrics that are 
>> equivalent.
>> I have two categories of mounts on the host:
>>
>> These group of mounts share the underlying storage and have duplicated 
>> values (Note for brevity only 2 out of many are included)
>> {device="$DEVICE1",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/first"}
>>  
>> 86.6186759625663
>> {device="$DEVICE2",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/second"}
>>  
>> 86.6186759625663
>>
>> These group of mounts do not share underlying storage
>> {device="$DEVICE3",fstype="$FS2",instance="$INSTANCE1",job="special_host",mountpoint="/var/log"}
>>  
>> 85.1214545444532
>>
>> I want to alert when any single host is above the threshold. When the 
>> instance is not in the "shared" group, this is trivial. But when the query 
>> returns many results This causes alertmanager problems. 
>>
>> My promql knowledge is lacking on how to get around this limitation, but 
>> these are the things I've tried. Each has a problemdoesn't
>>
>> topk- flaps between each of the alerting instances as the labels change.
>> topk(1, sum by (instance, mountpoint, device) (SIZE_QUERY) > 80)
>>
>> sum by returns too many and puts alertmanager to its knees which breaks 
>> our alerting in general
>> sum by (device, instance) (SIZE_QUERY) > 80
>> sum by (device, instance, mountpount) (SIZE_QUERY) > 80
>>
>> max doesn't show the labels which makes notifications hard to debug the 
>> problem- what instance, what device?
>> max(SIZE_QUERY > 80)
>>
>> Is there a possible solution to this I haven't considered
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d067fca1-4ed3-4f5b-a94f-989255ef2c50n%40googlegroups.com.

Reply via email to