On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:

On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

But for Prometheus, with keep_firing_for, it will be like the same alert.


If the alerts have the exact same set of labels (e.g. the alert is at the 
level of the RAID controller, not at the level of individual drives) then 
yes.


Which will still be quite often the case, I guess. Sometimes it may not 
matter, i.e. when a "new" alert (which has the same label set) is "missed" 
because of keep_firing_for, but sometimes it may.
 

It failed, it fixed, it failed again within keep_firing_for: then you only 
get a single alert, with no additional notification.
But that's not the problem you originally asked for:
"When the target goes down, the alert clears and as soon as it's back, it 
pops up again, sending a fresh alert notification."


Sure, and this can be avoided with keep_firing_for, but as far as I can see 
only in some cases (since one wants to keep keep_firing_for shortish) and 
at a cost of loosing information when the alert condition actually went 
away (which Prometheus does can in principle know) and came back while 
still firing.

 

keep_firing_for can be set differently for different alerts.  So you can 
set it to 10m for the "up == 0" alert, and not set it at all for the RAID 
alert, if that's what you want.


If there was no other way than the current keep_firing_for respectively my 
idea for an alternative keep_firing_for that considers the up/down state of 
the queried metrics isn't possible and/or reasonable - then rather than 
being able to set keep_firing_for per alert I'd wish to be able to set it 
per queried instance.

For some cases what I'm working at the university it might have been a nice 
try to (automatically) query the status of an alert and take action if it 
fires, but then I'd also rather like to stop that, rather soon after the 
alert (actually) stops. If I have to use a longer keep_firing_for because 
of a set of unstable nodes, then either, I get the penalty of unnecessarily 
long firing alerts for all nodes, or I maintain different set of alerts, 
which would be possible but also quite ugly.


  

Surely that delay is essential for the de-flapping scenario you describe: 
you can't send the alert resolved message until you are *sure* the alert 
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before 
keeping_firing_for had expired), and the problem recurred, then you'd have 
to send out a new alert failing message - which is the flap noise I think 
you are asking to suppress.


Okay maybe we have a misunderstanding here, or better said, I guess there 
are two kinds of flapping alerts:

For example, assume an alert that monitors the utilised disk space on the 
root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. `up` is all the time 
1).
- But IO is happening, that just causes the 80% to be exceeded and then 
fallen below every few seconds.

Type 2 Flapping
- There is IO, but the utilisation is always above 80%, say it's already at 
~ 90% all the time.
- My scrapes fail every now and then[0]

I honestly haven't even thought about type 1 yet. But I think these are the 
ones which would be perfectly solved by keep_firing_for.
Well even there I'd still like to be able to have the keep_firing_for 
applied only to a given label set e.g. something like: keep_firing_for: 10m 
on {alertnames~="regex-for-my-known-flapping-alerts"}

Type 2 is the one that causes me headaches right now.

That is why I thought before, it could be solved by something like 
keep_firing_for but that also takes into account whether any of the metrics 
it queries were from a target that is "currently" down - and only then let 
keep_firing_for take effect.


Thanks,
Chris.


[0] I do have a number of hosts, where this constantly happen, not really 
sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in 
best-effort class) it happens quite often. The node is under high load 
(it's one of our compute node for the LHC Computing Grid)... so I guess 
maybe it's just "overloaded". So I don't think this will go away and I 
somehow have to get it working with the scrapes failing every now and then.

What actually puzzled me more is this:
[image: Screenshot from 2024-04-09 00-24-59.png]
That's some of the graphs from the Node Full Exporter Grafana dashboard, 
all for one node (which is one of the flapping ones).
As you can see, Memory Basic and Disc Space Used Basic have a gap, where 
scraping failed.
My assumption was, that - for a given target&instance - either scraping 
fails for all metrics or succeeds for all.
But here, only the right side plots have gaps, the left side ones don't.

Maybe that's just some consequence of these using counters and rate() or 
irate()?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d8c6ff8e-f820-4ed3-a8e4-c8cbc79f40d6n%40googlegroups.com.

Reply via email to