Re: [prometheus-users] Alerts resolved upon prometheus crash

Daniel Swarbrick Thu, 05 Mar 2020 01:17:22 -0800

By default, Alertmanager will consider alerts resolved if 5 minutes or more 
elapses without the alert firiing (resolve_timeout config option).


If your Prometheus instance crashes and takes more than 5 minutes to 
restart, it's highly likely that any previously firing alerts will be 
"resolved". If the alerting rule conditions still exist after the restart, 
new alerts will be fired.

On Wednesday, March 4, 2020 at 12:45:11 PM UTC+1, Julien Pivotto wrote:
>
> On 04 Mar 12:39, Julien Pivotto wrote: 
> > On 04 Mar 12:38, Julien Pivotto wrote: 
> > > Hello there, 
> > > 
> > > We are running a pair of HA prometheis and HA alertmanagers. 
> > > 
> > > One prometheus server OOM'd; and restarted. When it was down, we 
> > > received alert resolution notifications from the alertmanager: 
> > > 
> > > > resolved (duration: 115h45m0s) 
> > > 
> > > But a few seconds after: 
> > > 
> > > > firing (duration: 115h52m16s) 
> > > 
> > > I would have expected that the second prometheus, which had the alert 
> > > all the time and was working as expected, would have prevented the 
> alert 
> > > to disappear. 
> > > 
> > > Note that the alert does NOT have a `for` clause. 
> > > 
> > > There is an entry at 9:44:39, then the server drops, and the alert is 
> > > firing again at 9:53. Note: We received the new "firing" at 9:52, with 
> included 115h52m16s of duration. 
> > > 
> > > Both Prometheis servers send alerts to both alertmanagers. 
> > > 
> > > 
> > > What can have appened here? 
> > > 
> > > Our evaluation_interval is 1m, and resend-delay is default. 
> > > 
> > > -- 
> > >  (o-    Julien Pivotto 
> > >  //\    Open-Source Consultant 
> > >  V_/_   Inuits - https://www.inuits.eu 
> > > 
> > > -- 
> > > You received this message because you are subscribed to the Google 
> Groups "Prometheus Users" group. 
> > > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/20200304113821.GA19241%40oxygen.
>  
>
> > 
> > Note: alertmanagers are 0.20.0 pulled from GH releases and both 
> > prometheus are 2.16.0 pulled from GH releases too. 
>
>
> When I look at the metrics, it looks like 
> rate(alertmanager_alerts_received_total[5m]) is showing a lot of 
> 'resolved' at that time. It it possible that Prometheus somehow sends 
> resolved alerts when TSDB is not yet ready? And because those rules were 
> running for a long time, we tried to restore them ? 
>
> regards, 
>
>
> -- 
>  (o-    Julien Pivotto 
>  //\    Open-Source Consultant 
>  V_/_   Inuits - https://www.inuits.eu 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c78909f5-1f22-4e2a-a276-794408a8dae5%40googlegroups.com.

Re: [prometheus-users] Alerts resolved upon prometheus crash

Reply via email to