On 05 Mar 01:17, Daniel Swarbrick wrote: > By default, Alertmanager will consider alerts resolved if 5 minutes or more > elapses without the alert firiing (resolve_timeout config option). > > If your Prometheus instance crashes and takes more than 5 minutes to > restart, it's highly likely that any previously firing alerts will be > "resolved". If the alerting rule conditions still exist after the restart, > new alerts will be fired.
Except that another prometheus server was still sending the alerts, so that is not likely the explanation :( But the server was in a pretty bad shape so maybe the alertmanager on the same host was foobar too doring that time. > On Wednesday, March 4, 2020 at 12:45:11 PM UTC+1, Julien Pivotto wrote: > > > > On 04 Mar 12:39, Julien Pivotto wrote: > > > On 04 Mar 12:38, Julien Pivotto wrote: > > > > Hello there, > > > > > > > > We are running a pair of HA prometheis and HA alertmanagers. > > > > > > > > One prometheus server OOM'd; and restarted. When it was down, we > > > > received alert resolution notifications from the alertmanager: > > > > > > > > > resolved (duration: 115h45m0s) > > > > > > > > But a few seconds after: > > > > > > > > > firing (duration: 115h52m16s) > > > > > > > > I would have expected that the second prometheus, which had the alert > > > > all the time and was working as expected, would have prevented the > > alert > > > > to disappear. > > > > > > > > Note that the alert does NOT have a `for` clause. > > > > > > > > There is an entry at 9:44:39, then the server drops, and the alert is > > > > firing again at 9:53. Note: We received the new "firing" at 9:52, with > > included 115h52m16s of duration. > > > > > > > > Both Prometheis servers send alerts to both alertmanagers. > > > > > > > > > > > > What can have appened here? > > > > > > > > Our evaluation_interval is 1m, and resend-delay is default. > > > > > > > > -- > > > > (o- Julien Pivotto > > > > //\ Open-Source Consultant > > > > V_/_ Inuits - https://www.inuits.eu > > > > > > > > -- > > > > You received this message because you are subscribed to the Google > > Groups "Prometheus Users" group. > > > > To unsubscribe from this group and stop receiving emails from it, send > > an email to [email protected] <javascript:>. > > > > To view this discussion on the web visit > > https://groups.google.com/d/msgid/prometheus-users/20200304113821.GA19241%40oxygen. > > > > > > > > > > Note: alertmanagers are 0.20.0 pulled from GH releases and both > > > prometheus are 2.16.0 pulled from GH releases too. > > > > > > When I look at the metrics, it looks like > > rate(alertmanager_alerts_received_total[5m]) is showing a lot of > > 'resolved' at that time. It it possible that Prometheus somehow sends > > resolved alerts when TSDB is not yet ready? And because those rules were > > running for a long time, we tried to restore them ? > > > > regards, > > > > > > -- > > (o- Julien Pivotto > > //\ Open-Source Consultant > > V_/_ Inuits - https://www.inuits.eu > > > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/c78909f5-1f22-4e2a-a276-794408a8dae5%40googlegroups.com. -- (o- Julien Pivotto //\ Open-Source Consultant V_/_ Inuits - https://www.inuits.eu -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/20200305093509.GA26460%40oxygen.
signature.asc
Description: PGP signature

