[prometheus-users] Re: Single Prometheus for Large Cluster

Brian Candler Tue, 07 Sep 2021 01:51:00 -0700

It's not clear what you mean by "No. of Nodes" - whether you mean hosts 
(e.g. which you're scraping using node_exporter), or pods, or something 
else.  But what matters is the total number of metrics, and the amount of 
metric churn,  i.e. the rate at which new timeseries are being created 
dynamically; and also how much querying is going on.


If you go to Prometheus web interface, Status > TSDB Status, you'll get 
some statistics which may help you.  Consider:

- collecting fewer metrics (by changing what you scrape, and/or using 
metric_relabel_configs to drop some timeseries which are not of interest)

- see if it's possible to reduce timeseries churn.  For example, if you 
have one application which is generating large numbers of short-lived pods 
then you may wish to reduce or suppress the metrics collected for those 
pods.

- have a look at the PromQL queries being executed, and whether any of 
these are using excessing amounts of RAM.  The query log 
<https://prometheus.io/docs/guides/query-log/> may help.  You can also 
apply limits to how much memory is used by individual queries using
      --query.max-concurrency=20  # default
      --query.max-samples=50000000  # default
(although that may cause the offending queries to fail)

There are also blog posts out there which you can turn up with a search, 
e.g.
https://source.coveo.com/2021/03/03/prometheus-memory/

On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] wrote:

> Hi everyone, I am new here.
>
> I would like to seek some advice on the design approach we should take.
> With the given problem below, in terms of cost, how can we set up 
> Prometheus with a large cluster.
>
> *Variables:*
> *Installation: *Kube-stack-prometheus helm chart.
> *Autoscale*: yes
> *No. of Nodes*: 1000 up to 1300
> *Mesh*: Istio
> *Memory Usage:* 50GB (Still gets OOM)
> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger
>
> *Issue:*
> 1. We cannot expand a larger node for Prometheus as 60GB memory is already 
> expensive.  (cost not approved by management)
> 2. Removing unnecessary metrics is not yet advised because we do not know 
> which metrics of istio, jaeger and kiali are needed.
>
> *Tried solution:*
> We have federated the single instance of prometheus with Thanos Receivers, 
> however, the issue is still there because kiali queries its data directly 
> from prometheus which eventually gets OOM.
>
> *Question:*
> We are thinking of firing up multiple prometheus for each namespace and 
> adding thanos-sidecar with the same scrape config since thanos will 
> deduplicate all duplicated metrics. This approach would solve the issue in 
> Grafana queries but not in Kiali. 
>
> How can we set up a multiple prometheus (low cost) but single instance 
> prometheus for kiali (whole cluster)?
>
> Appreciate any help. Thank you.
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com.

[prometheus-users] Re: Single Prometheus for Large Cluster

Reply via email to