The things I'm currently working on: * Disabling auto-scaling, or setting the auto-scaler minimums higher to avoid down-scaling when it's unnecessary. * Using https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior to dampen up/down behavior * Using https://keda.sh/ to use better metrics for auto-scaling controls * Eliminating single-core pods by using worker pools for single-threaded languages like Python/Ruby/Node. Or re-writing services in Go / Java to make them multi-threaded. * Increasing the node size to reduce the number of nodes per cluster. * Dropping un-used / duplicate container metrics from cAdvisor (I'm working on a blog post about this)
On Wed, Sep 8, 2021 at 9:20 AM patricia lee <[email protected]> wrote: > Hello Ben, > > Yes, our cluster set up is heavy-autoscaling and a lot of single or less > core pods (500m to 1000m cpu). > May we know, what resolution did you take for a heavily auto-scale cluster > with single-core pods? > > Appreciate your response. > > > Btw, I ran promtool in our prometheus and these are high churn labels > (default config from kube-prometheus-stack) > > *Label pairs most involved in churning:* > 59339 service=rancher-monitoring-kubelet > 59339 job=kubelet > 59339 endpoint=https-metrics > 52002 metrics_path=/metrics/cadvisor > 51475 namespace=cluster2 > 32853 job=kube-state-metrics > 32849 service=rancher-monitoring-kube-state-metrics > 32848 endpoint=http > 24840 container=POD > 17944 namespace=cattle-monitoring-system > 15974 container=kube-state-metrics > 15249 container=node-exporter > 14683 job=node-exporter > 14683 endpoint=metrics > 14683 service=rancher-monitoring-prometheus-node-exporter > 13879 namespace=kube-system > > *Label names most involved in churning:* > 110756 __name__ > 109700 instance > 109670 service > 109670 endpoint > 109670 job > 107602 namespace > 100450 pod > 87636 container > 64686 node > 59339 metrics_path > 51953 id > 38376 image > 37733 name > 21466 device > 10720 interface > 9706 reason > 6072 job_name > 5418 le > 4746 fstype > 4746 mountpoint > > > > On Tue, Sep 7, 2021 at 10:39 PM Ben Kochie <[email protected]> wrote: > >> I don't know if this is still the case, but there are some label >> configurations in the helm cart that lead to excessive labels on >> Kubernetes. This can lead to index/memory bloat. >> >> Most of the memory bloat I've seen in our production clusters lately has >> more to do with auto-scaling pod churn. If you're using a heavy >> auto-scaling, and lots of single-core pods, you'll end up bloating the >> metrics a lot. >> >> On Tue, Sep 7, 2021 at 3:51 PM Brian Candler <[email protected]> wrote: >> >>> Such a short retention is unlikely to help at all; WAL blocks have a 2 >>> hour duration I think. >>> >>> Across some systems I have here, the average number of metrics per node >>> is 2366: this is the (expensive) query which gives it: >>> avg(count by (instance) ({job="node"})) >>> >>> So with 1300 nodes that would be about 3 million metrics. Quite a lot, >>> but not extraordinarily so. I've seen recommendations to start splitting >>> Prometheus servers when you reach 2m. There is a RAM calculation tool here: >>> >>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion >>> With 3m series and 1m unique label pairs, it still only comes out to >>> 8GB. If you're needing much more than that, then you need to read and >>> understand the stats from the TSDB status page. You can post them here if >>> you want help interpreting them. And you need to understand what queries >>> (if any) are taking place against your database, since those use RAM too. >>> >>> Looking at "Top 10 series count by metric names" in the Prometheus >>> Status page, in my case it's node_cpu_seconds_total{}. For me it's >>> node_cpu_seconds_total{}. If you don't require the usage of each core >>> individually, then you might be inclined to drop it. >>> >>> You could also see if victoriametrics + vmagent works better for your >>> use case. >>> >>> On Tuesday, 7 September 2021 at 13:57:48 UTC+1 [email protected] wrote: >>> >>>> Thank you Brian for the reply. Yes I mean host (nodes). >>>> What we have done for the mean time is we have set the retentionTime of >>>> prometheus to 5minutes (which I am not comfortable) but was advised by >>>> seniors just for us to continue. >>>> Thanks for the information above, i'll check it out and try on our >>>> cluster environment. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Sep 7, 2021 at 4:50 PM Brian Candler <[email protected]> wrote: >>>> >>>>> It's not clear what you mean by "No. of Nodes" - whether you mean >>>>> hosts (e.g. which you're scraping using node_exporter), or pods, or >>>>> something else. But what matters is the total number of metrics, and the >>>>> amount of metric churn, i.e. the rate at which new timeseries are being >>>>> created dynamically; and also how much querying is going on. >>>>> >>>>> If you go to Prometheus web interface, Status > TSDB Status, you'll >>>>> get some statistics which may help you. Consider: >>>>> >>>>> - collecting fewer metrics (by changing what you scrape, and/or using >>>>> metric_relabel_configs to drop some timeseries which are not of interest) >>>>> >>>>> - see if it's possible to reduce timeseries churn. For example, if >>>>> you have one application which is generating large numbers of short-lived >>>>> pods then you may wish to reduce or suppress the metrics collected for >>>>> those pods. >>>>> >>>>> - have a look at the PromQL queries being executed, and whether any of >>>>> these are using excessing amounts of RAM. The query log >>>>> <https://prometheus.io/docs/guides/query-log/> may help. You can >>>>> also apply limits to how much memory is used by individual queries using >>>>> --query.max-concurrency=20 # default >>>>> --query.max-samples=50000000 # default >>>>> (although that may cause the offending queries to fail) >>>>> >>>>> There are also blog posts out there which you can turn up with a >>>>> search, e.g. >>>>> https://source.coveo.com/2021/03/03/prometheus-memory/ >>>>> >>>>> On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] >>>>> wrote: >>>>> >>>>>> Hi everyone, I am new here. >>>>>> >>>>>> I would like to seek some advice on the design approach we should >>>>>> take. >>>>>> With the given problem below, in terms of cost, how can we set up >>>>>> Prometheus with a large cluster. >>>>>> >>>>>> *Variables:* >>>>>> *Installation: *Kube-stack-prometheus helm chart. >>>>>> *Autoscale*: yes >>>>>> *No. of Nodes*: 1000 up to 1300 >>>>>> *Mesh*: Istio >>>>>> *Memory Usage:* 50GB (Still gets OOM) >>>>>> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger >>>>>> >>>>>> *Issue:* >>>>>> 1. We cannot expand a larger node for Prometheus as 60GB memory is >>>>>> already expensive. (cost not approved by management) >>>>>> 2. Removing unnecessary metrics is not yet advised because we do not >>>>>> know which metrics of istio, jaeger and kiali are needed. >>>>>> >>>>>> *Tried solution:* >>>>>> We have federated the single instance of prometheus with Thanos >>>>>> Receivers, however, the issue is still there because kiali queries its >>>>>> data >>>>>> directly from prometheus which eventually gets OOM. >>>>>> >>>>>> *Question:* >>>>>> We are thinking of firing up multiple prometheus for each namespace >>>>>> and adding thanos-sidecar with the same scrape config since thanos will >>>>>> deduplicate all duplicated metrics. This approach would solve the issue >>>>>> in >>>>>> Grafana queries but not in Kiali. >>>>>> >>>>>> How can we set up a multiple prometheus (low cost) but single >>>>>> instance prometheus for kiali (whole cluster)? >>>>>> >>>>>> Appreciate any help. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmpdW%2BpPZ1Z%3DtgjJHwwy%3D6eSvEynTPwmnTRFYR-89HDhxA%40mail.gmail.com.

