Hello Brian, thank you, for the reply. Currently, we deployed Prometheus into each EKS cluster and group them in one grafana. Looks good so far. Many thanks. Will check the behaviors for this setup in the feature.
On Friday, July 3, 2020 at 10:34:43 AM UTC+3, Brian Candler wrote: > > Maybe you are just collecting a lot of metrics in a single prometheus > instance. There's a tool which will give you an estimate of RAM usage here: > > https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion > > For disk space, I'd start with an estimate of 1.7 bytes per metric sample > - so that usage depends on your scrape interval. You say it's growing at > about 900MB/hour; if you were using a 15-second scrape interval that > implies about 2.2m metrics, which is quite high to be putting into one > prometheus instance (the recommended maximum is 2 million). > > So the first thing to check is how many metrics you're *actually* > collecting, and also whether you have a high churn rate in time series > (i.e. lots of pods starting and stopping). You can get this info from the > prometheus GUI under "status > runtime & build info". Look especially at > "Head Stats". > > Your 30GB RAM usage suggests high series churn. Beware that if you are > monitoring pod-level metrics, every pod is unique, so will generate its own > set of timeseries. If you have 10 pods destroyed and created per minute, > and each pod generates 10K metrics, that's 6 million new time series every > hour. At any instant not all of these will be active, but the "head" chunk > typically carries the last 2 hours' worth of timeseries. The solution is > not to churn pods so much, or else filter the data collection so you're > collecting much less pod-level data. > > If you are sure that the number of series you're collecting is much lower > than 2m, then there may be a problem. Please report the stats, the *exact* > version of prometheus you're running, and also show any logs generated by > prometheus itself. > > If you are in fact collecting millions of timeseries (and wish to keep > them all rather than dropping some), then as I said before this is more > than is recommended for a single prometheus instance. If you have 5 > clusters then it sounds like you'd be better with a separate prometheus per > cluster, especially as they are in separate AWS accounts. You can still > have a single Grafana instance, which either queries them individually, or > uses something like promxy to combine them, or use federation to collect a > subset of metrics into a separate prometheus for a global view, or you can > look at higher-performance add-ons like Thanos. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8f577653-a092-4476-b0be-8b2caeed74c1o%40googlegroups.com.

