Hello Brian, After leaving the prometheus for 16 hrs with 5 mins retention (my seniors' advice), the memory was initially at 22 GB but after 16 hrs it was already at 39 GB and might still increase. We checked the TSDB status page and we found that the highest memory usage label is id and the highest count by metric names is kubelet_run_time_operations_duration_second_bucket. I'll suggest to our seniors in the team if we can drop label id and kubelet_run_time_operations_duration_second_bucket to see if it would reduce memory consumption of our prometheus.
I ran tsdb analyze in the prometheus itself, here are the results as well. Block ID: 01FF1WDW4PH937C3XT2E9R621K Duration: 2h0m0s Series: 4434558 Label names: 311 Postings (unique label pairs): 122598 Postings entries (total label pairs): 47468088 *Label pairs most involved in churning:* 59339 service=rancher-monitoring-kubelet 59339 job=kubelet 59339 endpoint=https-metrics 52002 metrics_path=/metrics/cadvisor 51475 namespace=cluster2 32853 job=kube-state-metrics 32849 service=rancher-monitoring-kube-state-metrics 32848 endpoint=http 24840 container=POD 17944 namespace=cattle-monitoring-system 15974 container=kube-state-metrics 15249 container=node-exporter 14683 job=node-exporter 14683 endpoint=metrics 14683 service=rancher-monitoring-prometheus-node-exporter 13879 namespace=kube-system *Label names most involved in churning:* 110756 __name__ 109700 instance 109670 service 109670 endpoint 109670 job 107602 namespace 100450 pod 87636 container 64686 node 59339 metrics_path 51953 id 38376 image 37733 name 21466 device 10720 interface 9706 reason 6072 job_name 5418 le 4746 fstype 4746 mountpoint *Label names with highest cumulative label value length:* 2690572 id 1727227 name 812271 container_id 333072 uid 298590 pod 162072 pod_uid 101609 address 67431 pod_ip 63985 replicaset 63634 device 58812 interface 57241 owner_name 54539 image 50714 __name__ 45383 node 45334 nodename 45334 label_kubernetes_io_hostname 41997 image_id 41844 created_by_name 39312 provider_id *Highest cardinality labels:* 26763 id 17988 name 11127 container_id 9252 uid 9249 pod 5977 address 5022 pod_ip 4502 pod_uid 4203 interface 4135 device 1987 instance 1773 owner_name 1741 replicaset 1741 label_pod_template_hash 1422 __name__ 1164 created_by_name 937 node 937 host_ip 936 label_kubernetes_io_hostname 936 nodename *Highest cardinality metric names:* 178836 kubelet_runtime_operations_duration_seconds_bucket 161805 container_tasks_state 142212 storage_operation_duration_seconds_bucket 129444 container_memory_failures_total 121212 kubelet_docker_operations_duration_seconds_bucket 67739 kube_pod_container_status_waiting_reason 59724 kubelet_http_requests_duration_seconds_bucket 58062 kube_pod_container_status_terminated_reason 58062 kube_pod_container_status_last_terminated_reason 50292 rest_client_request_duration_seconds_bucket 46260 kube_pod_status_phase 44709 kubelet_runtime_operations_latency_microseconds 39645 container_network_receive_packets_dropped_total 39645 container_network_transmit_bytes_total 39645 container_network_transmit_errors_total 39645 container_network_transmit_packets_total 39645 container_network_receive_packets_total 39645 container_network_transmit_packets_dropped_total 39645 container_network_receive_bytes_total 39645 container_network_receive_errors_total On Tue, Sep 7, 2021 at 9:51 PM Brian Candler <[email protected]> wrote: > Such a short retention is unlikely to help at all; WAL blocks have a 2 > hour duration I think. > > Across some systems I have here, the average number of metrics per node is > 2366: this is the (expensive) query which gives it: > avg(count by (instance) ({job="node"})) > > So with 1300 nodes that would be about 3 million metrics. Quite a lot, > but not extraordinarily so. I've seen recommendations to start splitting > Prometheus servers when you reach 2m. There is a RAM calculation tool here: > > https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion > With 3m series and 1m unique label pairs, it still only comes out to 8GB. > If you're needing much more than that, then you need to read and understand > the stats from the TSDB status page. You can post them here if you want > help interpreting them. And you need to understand what queries (if any) > are taking place against your database, since those use RAM too. > > Looking at "Top 10 series count by metric names" in the Prometheus Status > page, in my case it's node_cpu_seconds_total{}. For me it's > node_cpu_seconds_total{}. If you don't require the usage of each core > individually, then you might be inclined to drop it. > > You could also see if victoriametrics + vmagent works better for your use > case. > > On Tuesday, 7 September 2021 at 13:57:48 UTC+1 [email protected] wrote: > >> Thank you Brian for the reply. Yes I mean host (nodes). >> What we have done for the mean time is we have set the retentionTime of >> prometheus to 5minutes (which I am not comfortable) but was advised by >> seniors just for us to continue. >> Thanks for the information above, i'll check it out and try on our >> cluster environment. >> >> >> >> >> >> >> On Tue, Sep 7, 2021 at 4:50 PM Brian Candler <[email protected]> wrote: >> >>> It's not clear what you mean by "No. of Nodes" - whether you mean hosts >>> (e.g. which you're scraping using node_exporter), or pods, or something >>> else. But what matters is the total number of metrics, and the amount of >>> metric churn, i.e. the rate at which new timeseries are being created >>> dynamically; and also how much querying is going on. >>> >>> If you go to Prometheus web interface, Status > TSDB Status, you'll get >>> some statistics which may help you. Consider: >>> >>> - collecting fewer metrics (by changing what you scrape, and/or using >>> metric_relabel_configs to drop some timeseries which are not of interest) >>> >>> - see if it's possible to reduce timeseries churn. For example, if you >>> have one application which is generating large numbers of short-lived pods >>> then you may wish to reduce or suppress the metrics collected for those >>> pods. >>> >>> - have a look at the PromQL queries being executed, and whether any of >>> these are using excessing amounts of RAM. The query log >>> <https://prometheus.io/docs/guides/query-log/> may help. You can also >>> apply limits to how much memory is used by individual queries using >>> --query.max-concurrency=20 # default >>> --query.max-samples=50000000 # default >>> (although that may cause the offending queries to fail) >>> >>> There are also blog posts out there which you can turn up with a search, >>> e.g. >>> https://source.coveo.com/2021/03/03/prometheus-memory/ >>> >>> On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] wrote: >>> >>>> Hi everyone, I am new here. >>>> >>>> I would like to seek some advice on the design approach we should take. >>>> With the given problem below, in terms of cost, how can we set up >>>> Prometheus with a large cluster. >>>> >>>> *Variables:* >>>> *Installation: *Kube-stack-prometheus helm chart. >>>> *Autoscale*: yes >>>> *No. of Nodes*: 1000 up to 1300 >>>> *Mesh*: Istio >>>> *Memory Usage:* 50GB (Still gets OOM) >>>> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger >>>> >>>> *Issue:* >>>> 1. We cannot expand a larger node for Prometheus as 60GB memory is >>>> already expensive. (cost not approved by management) >>>> 2. Removing unnecessary metrics is not yet advised because we do not >>>> know which metrics of istio, jaeger and kiali are needed. >>>> >>>> *Tried solution:* >>>> We have federated the single instance of prometheus with Thanos >>>> Receivers, however, the issue is still there because kiali queries its data >>>> directly from prometheus which eventually gets OOM. >>>> >>>> *Question:* >>>> We are thinking of firing up multiple prometheus for each namespace and >>>> adding thanos-sidecar with the same scrape config since thanos will >>>> deduplicate all duplicated metrics. This approach would solve the issue in >>>> Grafana queries but not in Kiali. >>>> >>>> How can we set up a multiple prometheus (low cost) but single instance >>>> prometheus for kiali (whole cluster)? >>>> >>>> Appreciate any help. Thank you. >>>> >>>> >>>> >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAAMbZt_VUsys270vZ%2BcRpyo%3Dar8biLiBzOLO_ELezFA3tkMKCg%40mail.gmail.com.

