> I am encountering challenges with configuring Prometheus and Alertmanager 
> for my application's alarm system. Below are the configurations I am 
> currently using:
>
> *prometheus.yml:* 
>
> Scrape Interval: 1h

This scrape interval is far too high. Although it's not well documented,
you can't set scrape_interval higher than two or three minutes without
causing seriously weird issues, where your rules may not see metrics
because Prometheus considers the metrics stale. Prometheus considers
metrics stale if the most recent sample is more than five minutes old;
this time is not adjustable as far as I know. I believe you've already
seen signs of this from your other problems, but really, as far as I
know such a configuration basically isn't supported.

(In my view this is such a problem that Prometheus should at least
require a forced 'I know what I'm doing, really I want this' command
line option to accept a scrape interval that's larger than the staleness
interval, or maybe even within ten seconds or so of it.)

I believe that all rule evaluation intervals similarly need to be no
more than five minutes because of the stale metrics issue, since both
recording rules and alerting rules generate metrics (the recording rules
generate their metrics in an obvious way, the alerting rules generate
ALERTS metrics and some other ones). It's possible that alerts don't go
stale inside Prometheus despite their metrics going stale, but I
wouldn't count on this.

(Although it's possible that metrics from recording and/or alert rules
are special and are exempted from staleness, I would be surprised.)

Prometheus scrapes different targets at different offsets within their
scrape interval, so you can't synchronize scrapes and rule evaluations
the way you apparently want to. The time offset for any particular
scrape target is deterministic but not predictable (and it may change
between eg Prometheus releases, or even on a Prometheus restart).
Prometheus does this to spread out the load of scraping more or less
evenly across the scrape interval, rather than descending on all targets
simultaneously every X seconds or minutes.

I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.

        - cks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1394369.1711750147%40apps0.cs.toronto.edu.

Reply via email to