Re: [prometheus-users] Re: Single Prometheus for Large Cluster

Brian Candler Thu, 23 Sep 2021 04:26:03 -0700

Dropping individual labels isn't likely to make a huge difference, if 
you're still scraping the same set of timeseries.


The bag of labels is just what distinguishes one timeseries from another.  
It does have to be kept in memory, but it's static and doesn't use much RAM.

Dropping labels might even give you a short-term *increase* in RAM usage, 
as the timeseries with the old label set and the timeseries with the new 
label set are two different timeseries.

You're likely to see a bigger difference by reducing the number of 
timeseries you're scraping - either by changing the exporters to expose 
fewer metrics, or using metric relabelling to drop metrics which aren't of 
interest.

On Thursday, 23 September 2021 at 10:21:41 UTC+1 [email protected] wrote:

> Thanks for the information. 
>
> For the meantime,we are trying to drop the high memory usage label in our 
> prometheus, so we dropped the ID - (test environment)
> However, even if we dropped the labels on all jobs, the memory usage is 
> still at 5Gi (which is the same). Will the drop in memory usage of 
> Prometheus will only be seen after a few hours? We saw the same behavior in 
> our different environment - UAT where we drop ID but we waited for almost a 
> day before we saw some memory drops in grafana.
>
> Thank you.
>
> On Wed, Sep 8, 2021 at 4:41 PM Ben Kochie <[email protected]> wrote:
>
>> The things I'm currently working on:
>> * Disabling auto-scaling, or setting the auto-scaler minimums higher to 
>> avoid down-scaling when it's unnecessary.
>> * Using 
>> https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior
>>  
>> to dampen up/down behavior
>> * Using https://keda.sh/ to use better metrics for auto-scaling controls
>> * Eliminating single-core pods by using worker pools for single-threaded 
>> languages like Python/Ruby/Node. Or re-writing services in Go / Java to 
>> make them multi-threaded.
>> * Increasing the node size to reduce the number of nodes per cluster.
>> * Dropping un-used / duplicate container metrics from cAdvisor (I'm 
>> working on a blog post about this)
>>
>> On Wed, Sep 8, 2021 at 9:20 AM patricia lee <[email protected]> wrote:
>>
>>> Hello Ben,
>>>
>>> Yes, our cluster set up is heavy-autoscaling and a lot of single or less 
>>> core pods (500m to 1000m cpu).
>>> May we know, what resolution did you take for a heavily auto-scale 
>>> cluster with single-core pods? 
>>>
>>> Appreciate your response.
>>>
>>>
>>> Btw, I ran promtool in our prometheus and these are high churn labels 
>>> (default config from kube-prometheus-stack)
>>>
>>> *Label pairs most involved in churning:*
>>> 59339 service=rancher-monitoring-kubelet
>>> 59339 job=kubelet
>>> 59339 endpoint=https-metrics
>>> 52002 metrics_path=/metrics/cadvisor
>>> 51475 namespace=cluster2
>>> 32853 job=kube-state-metrics
>>> 32849 service=rancher-monitoring-kube-state-metrics
>>> 32848 endpoint=http
>>> 24840 container=POD
>>> 17944 namespace=cattle-monitoring-system
>>> 15974 container=kube-state-metrics
>>> 15249 container=node-exporter
>>> 14683 job=node-exporter
>>> 14683 endpoint=metrics
>>> 14683 service=rancher-monitoring-prometheus-node-exporter
>>> 13879 namespace=kube-system
>>>
>>> *Label names most involved in churning:*
>>> 110756 __name__
>>> 109700 instance
>>> 109670 service
>>> 109670 endpoint
>>> 109670 job
>>> 107602 namespace
>>> 100450 pod
>>> 87636 container
>>> 64686 node
>>> 59339 metrics_path
>>> 51953 id
>>> 38376 image
>>> 37733 name
>>> 21466 device
>>> 10720 interface
>>> 9706 reason
>>> 6072 job_name
>>> 5418 le
>>> 4746 fstype
>>> 4746 mountpoint
>>>
>>>
>>>
>>> On Tue, Sep 7, 2021 at 10:39 PM Ben Kochie <[email protected]> wrote:
>>>
>>>> I don't know if this is still the case, but there are some label 
>>>> configurations in the helm cart that lead to excessive labels on 
>>>> Kubernetes. This can lead to index/memory bloat.
>>>>
>>>> Most of the memory bloat I've seen in our production clusters lately 
>>>> has more to do with auto-scaling pod churn. If you're using a heavy 
>>>> auto-scaling, and lots of single-core pods, you'll end up bloating the 
>>>> metrics a lot.
>>>>
>>>> On Tue, Sep 7, 2021 at 3:51 PM Brian Candler <[email protected]> wrote:
>>>>
>>>>> Such a short retention is unlikely to help at all; WAL blocks have a 2 
>>>>> hour duration I think.
>>>>>
>>>>> Across some systems I have here, the average number of metrics per 
>>>>> node is 2366: this is the (expensive) query which gives it:
>>>>> avg(count by (instance) ({job="node"}))
>>>>>
>>>>> So with 1300 nodes that would be about 3 million metrics.  Quite a 
>>>>> lot, but not extraordinarily so.  I've seen recommendations to start 
>>>>> splitting Prometheus servers when you reach 2m.  There is a RAM 
>>>>> calculation 
>>>>> tool here:
>>>>>
>>>>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>>>>> With 3m series and 1m unique label pairs, it still only comes out to 
>>>>> 8GB.  If you're needing much more than that, then you need to read and 
>>>>> understand the stats from the TSDB status page.  You can post them here 
>>>>> if 
>>>>> you want help interpreting them.  And you need to understand what queries 
>>>>> (if any) are taking place against your database, since those use RAM too.
>>>>>
>>>>> Looking at "Top 10 series count by metric names" in the Prometheus 
>>>>> Status page, in my case it's node_cpu_seconds_total{}.  For me it's 
>>>>> node_cpu_seconds_total{}.  If you don't require the usage of each core 
>>>>> individually, then you might be inclined to drop it.
>>>>>
>>>>> You could also see if victoriametrics + vmagent works better for your 
>>>>> use case.
>>>>>  
>>>>> On Tuesday, 7 September 2021 at 13:57:48 UTC+1 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Thank you Brian for the reply. Yes I mean host (nodes). 
>>>>>> What we have done for the mean time is we have set the retentionTime 
>>>>>> of prometheus to 5minutes (which I am not comfortable) but was advised 
>>>>>> by 
>>>>>> seniors just for us to continue.
>>>>>> Thanks for the information above, i'll check it out and try on our 
>>>>>> cluster environment.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 7, 2021 at 4:50 PM Brian Candler <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> It's not clear what you mean by "No. of Nodes" - whether you mean 
>>>>>>> hosts (e.g. which you're scraping using node_exporter), or pods, or 
>>>>>>> something else.  But what matters is the total number of metrics, and 
>>>>>>> the 
>>>>>>> amount of metric churn,  i.e. the rate at which new timeseries are 
>>>>>>> being 
>>>>>>> created dynamically; and also how much querying is going on.
>>>>>>>
>>>>>>> If you go to Prometheus web interface, Status > TSDB Status, you'll 
>>>>>>> get some statistics which may help you.  Consider:
>>>>>>>
>>>>>>> - collecting fewer metrics (by changing what you scrape, and/or 
>>>>>>> using metric_relabel_configs to drop some timeseries which are not of 
>>>>>>> interest)
>>>>>>>
>>>>>>> - see if it's possible to reduce timeseries churn.  For example, if 
>>>>>>> you have one application which is generating large numbers of 
>>>>>>> short-lived 
>>>>>>> pods then you may wish to reduce or suppress the metrics collected for 
>>>>>>> those pods.
>>>>>>>
>>>>>>> - have a look at the PromQL queries being executed, and whether any 
>>>>>>> of these are using excessing amounts of RAM.  The query log 
>>>>>>> <https://prometheus.io/docs/guides/query-log/> may help.  You can 
>>>>>>> also apply limits to how much memory is used by individual queries using
>>>>>>>       --query.max-concurrency=20  # default
>>>>>>>       --query.max-samples=50000000  # default
>>>>>>> (although that may cause the offending queries to fail)
>>>>>>>
>>>>>>> There are also blog posts out there which you can turn up with a 
>>>>>>> search, e.g.
>>>>>>> https://source.coveo.com/2021/03/03/prometheus-memory/
>>>>>>>
>>>>>>> On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone, I am new here.
>>>>>>>>
>>>>>>>> I would like to seek some advice on the design approach we should 
>>>>>>>> take.
>>>>>>>> With the given problem below, in terms of cost, how can we set up 
>>>>>>>> Prometheus with a large cluster.
>>>>>>>>
>>>>>>>> *Variables:*
>>>>>>>> *Installation: *Kube-stack-prometheus helm chart.
>>>>>>>> *Autoscale*: yes
>>>>>>>> *No. of Nodes*: 1000 up to 1300
>>>>>>>> *Mesh*: Istio
>>>>>>>> *Memory Usage:* 50GB (Still gets OOM)
>>>>>>>> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger
>>>>>>>>
>>>>>>>> *Issue:*
>>>>>>>> 1. We cannot expand a larger node for Prometheus as 60GB memory is 
>>>>>>>> already expensive.  (cost not approved by management)
>>>>>>>> 2. Removing unnecessary metrics is not yet advised because we do 
>>>>>>>> not know which metrics of istio, jaeger and kiali are needed.
>>>>>>>>
>>>>>>>> *Tried solution:*
>>>>>>>> We have federated the single instance of prometheus with Thanos 
>>>>>>>> Receivers, however, the issue is still there because kiali queries its 
>>>>>>>> data 
>>>>>>>> directly from prometheus which eventually gets OOM.
>>>>>>>>
>>>>>>>> *Question:*
>>>>>>>> We are thinking of firing up multiple prometheus for each namespace 
>>>>>>>> and adding thanos-sidecar with the same scrape config since thanos 
>>>>>>>> will 
>>>>>>>> deduplicate all duplicated metrics. This approach would solve the 
>>>>>>>> issue in 
>>>>>>>> Grafana queries but not in Kiali. 
>>>>>>>>
>>>>>>>> How can we set up a multiple prometheus (low cost) but single 
>>>>>>>> instance prometheus for kiali (whole cluster)?
>>>>>>>>
>>>>>>>> Appreciate any help. Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e5c1bb9d-79b9-4c5a-80c9-ea92ae0b5fcen%40googlegroups.com.

Re: [prometheus-users] Re: Single Prometheus for Large Cluster

Reply via email to