Re: [prometheus-users] Re: Single Prometheus for Large Cluster

Ben Kochie Wed, 08 Sep 2021 01:41:29 -0700

The things I'm currently working on:
* Disabling auto-scaling, or setting the auto-scaler minimums higher to
avoid down-scaling when it's unnecessary.
* Using
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior
to dampen up/down behavior
* Using https://keda.sh/ to use better metrics for auto-scaling controls
* Eliminating single-core pods by using worker pools for single-threaded
languages like Python/Ruby/Node. Or re-writing services in Go / Java to
make them multi-threaded.
* Increasing the node size to reduce the number of nodes per cluster.
* Dropping un-used / duplicate container metrics from cAdvisor (I'm working
on a blog post about this)


On Wed, Sep 8, 2021 at 9:20 AM patricia lee <[email protected]> wrote:

> Hello Ben,
>
> Yes, our cluster set up is heavy-autoscaling and a lot of single or less
> core pods (500m to 1000m cpu).
> May we know, what resolution did you take for a heavily auto-scale cluster
> with single-core pods?
>
> Appreciate your response.
>
>
> Btw, I ran promtool in our prometheus and these are high churn labels
> (default config from kube-prometheus-stack)
>
> *Label pairs most involved in churning:*
> 59339 service=rancher-monitoring-kubelet
> 59339 job=kubelet
> 59339 endpoint=https-metrics
> 52002 metrics_path=/metrics/cadvisor
> 51475 namespace=cluster2
> 32853 job=kube-state-metrics
> 32849 service=rancher-monitoring-kube-state-metrics
> 32848 endpoint=http
> 24840 container=POD
> 17944 namespace=cattle-monitoring-system
> 15974 container=kube-state-metrics
> 15249 container=node-exporter
> 14683 job=node-exporter
> 14683 endpoint=metrics
> 14683 service=rancher-monitoring-prometheus-node-exporter
> 13879 namespace=kube-system
>
> *Label names most involved in churning:*
> 110756 __name__
> 109700 instance
> 109670 service
> 109670 endpoint
> 109670 job
> 107602 namespace
> 100450 pod
> 87636 container
> 64686 node
> 59339 metrics_path
> 51953 id
> 38376 image
> 37733 name
> 21466 device
> 10720 interface
> 9706 reason
> 6072 job_name
> 5418 le
> 4746 fstype
> 4746 mountpoint
>
>
>
> On Tue, Sep 7, 2021 at 10:39 PM Ben Kochie <[email protected]> wrote:
>
>> I don't know if this is still the case, but there are some label
>> configurations in the helm cart that lead to excessive labels on
>> Kubernetes. This can lead to index/memory bloat.
>>
>> Most of the memory bloat I've seen in our production clusters lately has
>> more to do with auto-scaling pod churn. If you're using a heavy
>> auto-scaling, and lots of single-core pods, you'll end up bloating the
>> metrics a lot.
>>
>> On Tue, Sep 7, 2021 at 3:51 PM Brian Candler <[email protected]> wrote:
>>
>>> Such a short retention is unlikely to help at all; WAL blocks have a 2
>>> hour duration I think.
>>>
>>> Across some systems I have here, the average number of metrics per node
>>> is 2366: this is the (expensive) query which gives it:
>>> avg(count by (instance) ({job="node"}))
>>>
>>> So with 1300 nodes that would be about 3 million metrics.  Quite a lot,
>>> but not extraordinarily so.  I've seen recommendations to start splitting
>>> Prometheus servers when you reach 2m.  There is a RAM calculation tool here:
>>>
>>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>>> With 3m series and 1m unique label pairs, it still only comes out to
>>> 8GB.  If you're needing much more than that, then you need to read and
>>> understand the stats from the TSDB status page.  You can post them here if
>>> you want help interpreting them.  And you need to understand what queries
>>> (if any) are taking place against your database, since those use RAM too.
>>>
>>> Looking at "Top 10 series count by metric names" in the Prometheus
>>> Status page, in my case it's node_cpu_seconds_total{}.  For me it's
>>> node_cpu_seconds_total{}.  If you don't require the usage of each core
>>> individually, then you might be inclined to drop it.
>>>
>>> You could also see if victoriametrics + vmagent works better for your
>>> use case.
>>>
>>> On Tuesday, 7 September 2021 at 13:57:48 UTC+1 [email protected] wrote:
>>>
>>>> Thank you Brian for the reply. Yes I mean host (nodes).
>>>> What we have done for the mean time is we have set the retentionTime of
>>>> prometheus to 5minutes (which I am not comfortable) but was advised by
>>>> seniors just for us to continue.
>>>> Thanks for the information above, i'll check it out and try on our
>>>> cluster environment.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 7, 2021 at 4:50 PM Brian Candler <[email protected]> wrote:
>>>>
>>>>> It's not clear what you mean by "No. of Nodes" - whether you mean
>>>>> hosts (e.g. which you're scraping using node_exporter), or pods, or
>>>>> something else.  But what matters is the total number of metrics, and the
>>>>> amount of metric churn,  i.e. the rate at which new timeseries are being
>>>>> created dynamically; and also how much querying is going on.
>>>>>
>>>>> If you go to Prometheus web interface, Status > TSDB Status, you'll
>>>>> get some statistics which may help you.  Consider:
>>>>>
>>>>> - collecting fewer metrics (by changing what you scrape, and/or using
>>>>> metric_relabel_configs to drop some timeseries which are not of interest)
>>>>>
>>>>> - see if it's possible to reduce timeseries churn.  For example, if
>>>>> you have one application which is generating large numbers of short-lived
>>>>> pods then you may wish to reduce or suppress the metrics collected for
>>>>> those pods.
>>>>>
>>>>> - have a look at the PromQL queries being executed, and whether any of
>>>>> these are using excessing amounts of RAM.  The query log
>>>>> <https://prometheus.io/docs/guides/query-log/> may help.  You can
>>>>> also apply limits to how much memory is used by individual queries using
>>>>>       --query.max-concurrency=20  # default
>>>>>       --query.max-samples=50000000  # default
>>>>> (although that may cause the offending queries to fail)
>>>>>
>>>>> There are also blog posts out there which you can turn up with a
>>>>> search, e.g.
>>>>> https://source.coveo.com/2021/03/03/prometheus-memory/
>>>>>
>>>>> On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected]
>>>>> wrote:
>>>>>
>>>>>> Hi everyone, I am new here.
>>>>>>
>>>>>> I would like to seek some advice on the design approach we should
>>>>>> take.
>>>>>> With the given problem below, in terms of cost, how can we set up
>>>>>> Prometheus with a large cluster.
>>>>>>
>>>>>> *Variables:*
>>>>>> *Installation: *Kube-stack-prometheus helm chart.
>>>>>> *Autoscale*: yes
>>>>>> *No. of Nodes*: 1000 up to 1300
>>>>>> *Mesh*: Istio
>>>>>> *Memory Usage:* 50GB (Still gets OOM)
>>>>>> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger
>>>>>>
>>>>>> *Issue:*
>>>>>> 1. We cannot expand a larger node for Prometheus as 60GB memory is
>>>>>> already expensive.  (cost not approved by management)
>>>>>> 2. Removing unnecessary metrics is not yet advised because we do not
>>>>>> know which metrics of istio, jaeger and kiali are needed.
>>>>>>
>>>>>> *Tried solution:*
>>>>>> We have federated the single instance of prometheus with Thanos
>>>>>> Receivers, however, the issue is still there because kiali queries its 
>>>>>> data
>>>>>> directly from prometheus which eventually gets OOM.
>>>>>>
>>>>>> *Question:*
>>>>>> We are thinking of firing up multiple prometheus for each namespace
>>>>>> and adding thanos-sidecar with the same scrape config since thanos will
>>>>>> deduplicate all duplicated metrics. This approach would solve the issue 
>>>>>> in
>>>>>> Grafana queries but not in Kiali.
>>>>>>
>>>>>> How can we set up a multiple prometheus (low cost) but single
>>>>>> instance prometheus for kiali (whole cluster)?
>>>>>>
>>>>>> Appreciate any help. Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpdW%2BpPZ1Z%3DtgjJHwwy%3D6eSvEynTPwmnTRFYR-89HDhxA%40mail.gmail.com.

Re: [prometheus-users] Re: Single Prometheus for Large Cluster

Reply via email to