Hi,

I'm trying to figure out why data are missing sometime in dashboard backed 
by prometheus. Our setup is more or less standard prometheus-operator helm 
chart.  It defines following recording rule:

record: instance:node_cpu_utilisation:rate1m 
expr: 1 - avg without(cpu, mode) 
(rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))

There are 9 nodes in the cluster, but the dashboard that displays this 
metric only displays 7 nodes. Switching the dashboard to the expression 
directly shows all data as expected. Noteworthy things:


   - there are no exceptions in the log, no failed rule evaluations
   - the issue shows (almost) consistently for more than 2 hours by now
   - in two occasions in this period one of the missing nodes became part 
   of the recorded rule for what seems to be one scrape interval and dropped 
   again immediately
   - after prometheus restart, the issue persists
   - other rules defined within the same group seem to be impacted in the 
   same way (e.g. *instance:node_network_receive_bytes_excluding_lo:rate1m* 
   that calculates network usage in the same fashion)
   
This cluster suffered some performance issues in the past and had the 
scrape/evaluation interval extended to 90s. During this period the 
*instance:node_cpu_utilisation:rate1m* didn't record any data (because it 
uses range that was shorter than actual scrape/evaluation). The problem 
became apparent after switching back to the original 30s scrape/evaluation 
interval. In this moment all 9 nodes should have its CPU usage correctly 
displayed, but only 7 appeared.

Has anybody encountered similar situation?

Thanks,
Vojta

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/69c7ec0b-8f29-4f33-a31b-878c568a961dn%40googlegroups.com.

Reply via email to