Is the second instance still running? If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.
If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap? On 25 November 2020 14:58:50 GMT, "Yagyansh S. Kumar" <[email protected]> wrote: >On Wed, 25 Nov, 2020, 8:26 pm Stuart Clark, <[email protected]> >wrote: > >> How many Alertmanager instances are there? Can they talk to each >other and >> is Prometheus configured and able to push alerts to them all? >> >>> Single instance as of now. I did setup a Alertmanager Mesh of 2 >Alertmanagers but I am facing duplicate alert issue in that setup. >Another >issue that is pending for me. Hence, currently only a single >Alertmanager >is receiving alerts from my Prometheus instance. > >On 25 November 2020 14:07:41 GMT, "Yagyansh S. Kumar" < >> [email protected]> wrote: >>> >>> Hi Stuart. >>> >>> On Wed, 25 Nov, 2020, 6:56 pm Stuart Clark, ><[email protected]> >>> wrote: >>> >>>> On 25/11/2020 11:46, [email protected] wrote: >>>> > The alert formation doesn't seem to be a problem here, because it >>>> > happens for different alerts randomly. Below is the alert for >Exporter >>>> > being down for which it has happened thrice today. >>>> > >>>> > - alert: ExporterDown >>>> > expr: up == 0 >>>> > for: 10m >>>> > labels: >>>> > severity: "CRITICAL" >>>> > annotations: >>>> > summary: "Exporter down on *{{ $labels.instance }}*" >>>> > description: "Not able to fetch application metrics from >*{{ >>>> > $labels.instance }}*" >>>> > >>>> > - the ALERTS metric shows what is pending or firing over time >>>> > >> But the problem is that one of my ExporterDown alerts is >active >>>> > since the past 10 days, there is no genuine reason for the alert >to go >>>> > to a resolved state. >>>> > >>>> What do you have evaluation_interval set to in Prometheus, and >>>> resolve_timeout in Alertmanager? >>>> >>> >> My evaluation interval is 1m whereas my scrape timeout and scrape >>> interval are 25s. Resolve timeout in Alertmanager is 5m. >>> >>>> >>>> Is the alert definitely being resolved, as in you are getting a >resolved >>>> email/notification, or could it just be an email/notification for a >long >>>> running alert? - you should get another email/notification every >now and >>>> then based on repeat_interval. >>>> >>> >> Yes, I suspected that too in the beginning but I am logging each >and >>> every alert notification and found that I am indeed getting resolved >>> notification for that alert and again firing notification the very >next >>> second. >>> >>>> >>>> >>>> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. >> -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CC88CCF1-EA12-4239-BCCE-C132AAC3EAFE%40Jahingo.com.

