Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

Brian Candler Sat, 07 Nov 2020 08:41:45 -0800

On Saturday, 7 November 2020 13:35:47 UTC, Yagyansh S. Kumar wrote:
>
> Try looking at scrape_duration_seconds{job="Ping-All-Servers"}.  Maybe 
> it's borderline to the scrape interval.
> >> That's interesting. Here are the top 20 scrape_duration_seconds maxed 
> for last 1 hour by instance. Close to 5 seconds. Can this lead to some 
> issue?
>


Possibly. Maybe the scrape timeout handling has changed slightly between 
those version of prometheus.  I would in any case be concerned about the 
scrape duration being so close to the scrape interval, although failed 
scrapes should still show as "up == 0".

However, I note that the scrape.yml you posted shows the Ping-All-Servers 
job with a scrape interval of 10s, not 5s.

I also notice your module config has:

  icmp_prober:
   prober: icmp
   timeout: 30s
   icmp:
     preferred_ip_protocol: ip4 


I *think* the timeout is clipped to just under the scrape interval, so it 
should work, but I'd be inclined to set it lower anyway (say 3s); if you 
don't get a reply within 3s, you're unlikely to get one.

Since this test only does one ping, I would *expect* it to fail from time 
to time, and hence the alert go into "pending" state until the "for: 1m" 
has run its course.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/73d68985-b5e4-4413-a70e-6c4d54bcf57eo%40googlegroups.com.

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

Reply via email to