[prometheus-users] Re: Alertmanger flapping when query returns very similar results

[email protected] Thu, 22 Oct 2020 09:14:12 -0700

Right, that can easily happen when you add an alert rule that creates tons 
of alerts. That being said I work in an environment where we typically see 
10-20k alerts and alertmanager handles that without any problem, it caused 
performance issues on alertmanager side only a few times where we had a 
massive spike beyond the "usual" 10-20k.

The first thing that happens here is that all alerts are caused by a single 
alert rule, rather then many different rules, so they are all generated 
from single rule evaluation and then Prometheus sends them to alertmanager, 
which can be a lot to process at once. The 10-20k I've mentioned is all 
generated from various alert rules, a big number of different rules, they 
all start at slightly different time, so you don't get a flood of alert to 
process by alertmanager, it's a little bit more distributed over time. That 
being said Promethues will re-send all alerts to alertmanager after a few 
minutes, so I don't expect this to help all that much, expect for the fact 
that first time alertmanager sees an alert is the most expensive, I'm gonna 
guess it's a bit cheaper to receive an alert it already tracks as it 
doesn't need to allocate so much for it.

IMHO there isn't really much we can do about it easily other than trying to 
tweak the alert rule so it doesn't generate so many alerts, it's unlikely 
they will all get action anyway when the volume of the alerts is too much 
even for a machine. If all you care about is knowing that there are servers 
with >80% disk usage then a simple "count(SIZE_QUERY)>80" would do (you can 
add something like without(instance) to preserve some labels on it) OR (and 
hear me out) have a dashboard that shows you that information instead of an 
alert.

On Thursday, 22 October 2020 at 16:42:59 UTC+1 [email protected] 
wrote:

> sum by returns hundreds of results (one for each mountpoint) and they each 
> appear to generate an alert event.
>
> When we try to consolidate that to a single event by using topk, the event 
> changes rapidly between instances 
>
> This is the only thing logs showing a problem:
> level=warn ts=2020-10-22T14:10:32.912Z caller=delegate.go:272 
> component=cluster msg="dropping messages because too many are queued" 
> current=4103 limit=4096
>
>
>
> On Thursday, October 22, 2020 at 9:34:45 AM UTC-6 [email protected] wrote:
>
>> > But when the query returns many results This causes alertmanager 
>> problems.
>> > sum by returns too many and puts alertmanager to its knees which breaks 
>> our alerting in general
>>
>> This is a little too vague. What problems are you referring to? Are you 
>> seeing performance issues with alertmanager when there are too many alerts 
>> or is it usability problem when you get multiple alerts for the same 
>> underlaying problem?
>> On Thursday, 22 October 2020 at 16:24:56 UTC+1 [email protected] 
>> wrote:
>>
>>> I have a query to find out if space is running out:
>>> (100 - (100 * 
>>> node_filesystem_avail_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}
>>>  
>>> / 
>>> node_filesystem_size_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"}))
>>>
>>> For simplicity lets substitute this with SIZE_QUERY
>>>
>>> This VM is very special because there are multiple metrics that are 
>>> equivalent.
>>> I have two categories of mounts on the host:
>>>
>>> These group of mounts share the underlying storage and have duplicated 
>>> values (Note for brevity only 2 out of many are included)
>>> {device="$DEVICE1",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/first"}
>>>  
>>> 86.6186759625663
>>> {device="$DEVICE2",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/second"}
>>>  
>>> 86.6186759625663
>>>
>>> These group of mounts do not share underlying storage
>>> {device="$DEVICE3",fstype="$FS2",instance="$INSTANCE1",job="special_host",mountpoint="/var/log"}
>>>  
>>> 85.1214545444532
>>>
>>> I want to alert when any single host is above the threshold. When the 
>>> instance is not in the "shared" group, this is trivial. But when the query 
>>> returns many results This causes alertmanager problems. 
>>>
>>> My promql knowledge is lacking on how to get around this limitation, but 
>>> these are the things I've tried. Each has a problemdoesn't
>>>
>>> topk- flaps between each of the alerting instances as the labels change.
>>> topk(1, sum by (instance, mountpoint, device) (SIZE_QUERY) > 80)
>>>
>>> sum by returns too many and puts alertmanager to its knees which breaks 
>>> our alerting in general
>>> sum by (device, instance) (SIZE_QUERY) > 80
>>> sum by (device, instance, mountpount) (SIZE_QUERY) > 80
>>>
>>> max doesn't show the labels which makes notifications hard to debug the 
>>> problem- what instance, what device?
>>> max(SIZE_QUERY > 80)
>>>
>>> Is there a possible solution to this I haven't considered
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7f5092b6-cb53-4b0a-87b2-38e9a7667bdcn%40googlegroups.com.

[prometheus-users] Re: Alertmanger flapping when query returns very similar results

Reply via email to