On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:
On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote: But for Prometheus, with keep_firing_for, it will be like the same alert. If the alerts have the exact same set of labels (e.g. the alert is at the level of the RAID controller, not at the level of individual drives) then yes. Which will still be quite often the case, I guess. Sometimes it may not matter, i.e. when a "new" alert (which has the same label set) is "missed" because of keep_firing_for, but sometimes it may. It failed, it fixed, it failed again within keep_firing_for: then you only get a single alert, with no additional notification. But that's not the problem you originally asked for: "When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification." Sure, and this can be avoided with keep_firing_for, but as far as I can see only in some cases (since one wants to keep keep_firing_for shortish) and at a cost of loosing information when the alert condition actually went away (which Prometheus does can in principle know) and came back while still firing. keep_firing_for can be set differently for different alerts. So you can set it to 10m for the "up == 0" alert, and not set it at all for the RAID alert, if that's what you want. If there was no other way than the current keep_firing_for respectively my idea for an alternative keep_firing_for that considers the up/down state of the queried metrics isn't possible and/or reasonable - then rather than being able to set keep_firing_for per alert I'd wish to be able to set it per queried instance. For some cases what I'm working at the university it might have been a nice try to (automatically) query the status of an alert and take action if it fires, but then I'd also rather like to stop that, rather soon after the alert (actually) stops. If I have to use a longer keep_firing_for because of a set of unstable nodes, then either, I get the penalty of unnecessarily long firing alerts for all nodes, or I maintain different set of alerts, which would be possible but also quite ugly. Surely that delay is essential for the de-flapping scenario you describe: you can't send the alert resolved message until you are *sure* the alert has resolved (i.e. after keep_firing_for). Conversely: if you sent the alert resolved message immediately (before keeping_firing_for had expired), and the problem recurred, then you'd have to send out a new alert failing message - which is the flap noise I think you are asking to suppress. Okay maybe we have a misunderstanding here, or better said, I guess there are two kinds of flapping alerts: For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%. Type 1 Flapping: - The scraping of the metrics works all the time (i.e. `up` is all the time 1). - But IO is happening, that just causes the 80% to be exceeded and then fallen below every few seconds. Type 2 Flapping - There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time. - My scrapes fail every now and then[0] I honestly haven't even thought about type 1 yet. But I think these are the ones which would be perfectly solved by keep_firing_for. Well even there I'd still like to be able to have the keep_firing_for applied only to a given label set e.g. something like: keep_firing_for: 10m on {alertnames~="regex-for-my-known-flapping-alerts"} Type 2 is the one that causes me headaches right now. That is why I thought before, it could be solved by something like keep_firing_for but that also takes into account whether any of the metrics it queries were from a target that is "currently" down - and only then let keep_firing_for take effect. Thanks, Chris. [0] I do have a number of hosts, where this constantly happen, not really sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in best-effort class) it happens quite often. The node is under high load (it's one of our compute node for the LHC Computing Grid)... so I guess maybe it's just "overloaded". So I don't think this will go away and I somehow have to get it working with the scrapes failing every now and then. What actually puzzled me more is this: [image: Screenshot from 2024-04-09 00-24-59.png] That's some of the graphs from the Node Full Exporter Grafana dashboard, all for one node (which is one of the flapping ones). As you can see, Memory Basic and Disc Space Used Basic have a gap, where scraping failed. My assumption was, that - for a given target&instance - either scraping fails for all metrics or succeeds for all. But here, only the right side plots have gaps, the left side ones don't. Maybe that's just some consequence of these using counters and rate() or irate()? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d8c6ff8e-f820-4ed3-a8e4-c8cbc79f40d6n%40googlegroups.com.