I have a scrape job for node_exporter with "scrape_interval: 1m" and ~100 
targets.
Some metrics from that scrape are used to power a recording rule:
sum without(cpu) (rate(node_cpu_seconds_total[2m]))

Turns out that when this rule evaluates it generates a time series for all 
but one instance, on some random occasion (every 20-60 minutes) that one 
instance is getting those metrics generated (there are dots all over the 
graph instead of lines).

When I manually run sum without(cpu) (rate(node_cpu_seconds_total[2m])) I 
get all the metrics for all instances, including the affected one so the 
issue manifests itself only when evaluating recording rules.
Rule evaluation metrics from prometheus don't show any problems, no missed 
iterations or failures, logs are clear.

Now I know that rate() needs at least 2 samples so rate()[2m] only works 
with scrape_interval:1m only if everything is perfectly aligned.
If it's a problem with rate() not getting both samples then I'm not sure 
why a range query would work here, are range queries and rule evaluations 
querying data differently?
And how staleness plays out here?  Will a rule evaluation query data using 
5m look-back or does it have more "instant" query mechanism? 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c784164f-51b1-43b4-8f84-a24477517e57n%40googlegroups.com.

Reply via email to