It's not clear what you mean by "No. of Nodes" - whether you mean hosts (e.g. which you're scraping using node_exporter), or pods, or something else. But what matters is the total number of metrics, and the amount of metric churn, i.e. the rate at which new timeseries are being created dynamically; and also how much querying is going on.
If you go to Prometheus web interface, Status > TSDB Status, you'll get some statistics which may help you. Consider: - collecting fewer metrics (by changing what you scrape, and/or using metric_relabel_configs to drop some timeseries which are not of interest) - see if it's possible to reduce timeseries churn. For example, if you have one application which is generating large numbers of short-lived pods then you may wish to reduce or suppress the metrics collected for those pods. - have a look at the PromQL queries being executed, and whether any of these are using excessing amounts of RAM. The query log <https://prometheus.io/docs/guides/query-log/> may help. You can also apply limits to how much memory is used by individual queries using --query.max-concurrency=20 # default --query.max-samples=50000000 # default (although that may cause the offending queries to fail) There are also blog posts out there which you can turn up with a search, e.g. https://source.coveo.com/2021/03/03/prometheus-memory/ On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] wrote: > Hi everyone, I am new here. > > I would like to seek some advice on the design approach we should take. > With the given problem below, in terms of cost, how can we set up > Prometheus with a large cluster. > > *Variables:* > *Installation: *Kube-stack-prometheus helm chart. > *Autoscale*: yes > *No. of Nodes*: 1000 up to 1300 > *Mesh*: Istio > *Memory Usage:* 50GB (Still gets OOM) > *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger > > *Issue:* > 1. We cannot expand a larger node for Prometheus as 60GB memory is already > expensive. (cost not approved by management) > 2. Removing unnecessary metrics is not yet advised because we do not know > which metrics of istio, jaeger and kiali are needed. > > *Tried solution:* > We have federated the single instance of prometheus with Thanos Receivers, > however, the issue is still there because kiali queries its data directly > from prometheus which eventually gets OOM. > > *Question:* > We are thinking of firing up multiple prometheus for each namespace and > adding thanos-sidecar with the same scrape config since thanos will > deduplicate all duplicated metrics. This approach would solve the issue in > Grafana queries but not in Kiali. > > How can we set up a multiple prometheus (low cost) but single instance > prometheus for kiali (whole cluster)? > > Appreciate any help. Thank you. > > > > > > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com.

