Re: [prometheus-users] Re: Single Prometheus for Large Cluster

patricia lee Wed, 08 Sep 2021 00:03:59 -0700

Hello Brian,

After leaving the prometheus for 16 hrs with 5 mins retention (my seniors'
advice), the memory was initially at 22 GB but after 16 hrs it was already
at 39 GB and might still increase.
We checked the TSDB status page and we found that the highest memory usage
label is id and the highest count by metric names is
kubelet_run_time_operations_duration_second_bucket.
I'll suggest to our seniors in the team if we can drop label id
and kubelet_run_time_operations_duration_second_bucket to see if it would
reduce memory consumption of our prometheus.


I ran tsdb analyze in the prometheus itself, here are the results as well.

Block ID: 01FF1WDW4PH937C3XT2E9R621K
Duration: 2h0m0s
Series: 4434558
Label names: 311
Postings (unique label pairs): 122598
Postings entries (total label pairs): 47468088

*Label pairs most involved in churning:*
59339 service=rancher-monitoring-kubelet
59339 job=kubelet
59339 endpoint=https-metrics
52002 metrics_path=/metrics/cadvisor
51475 namespace=cluster2
32853 job=kube-state-metrics
32849 service=rancher-monitoring-kube-state-metrics
32848 endpoint=http
24840 container=POD
17944 namespace=cattle-monitoring-system
15974 container=kube-state-metrics
15249 container=node-exporter
14683 job=node-exporter
14683 endpoint=metrics
14683 service=rancher-monitoring-prometheus-node-exporter
13879 namespace=kube-system

*Label names most involved in churning:*
110756 __name__
109700 instance
109670 service
109670 endpoint
109670 job
107602 namespace
100450 pod
87636 container
64686 node
59339 metrics_path
51953 id
38376 image
37733 name
21466 device
10720 interface
9706 reason
6072 job_name
5418 le
4746 fstype
4746 mountpoint


*Label names with highest cumulative label value length:*
2690572 id
1727227 name
812271 container_id
333072 uid
298590 pod
162072 pod_uid
101609 address
67431 pod_ip
63985 replicaset
63634 device
58812 interface
57241 owner_name
54539 image
50714 __name__
45383 node
45334 nodename
45334 label_kubernetes_io_hostname
41997 image_id
41844 created_by_name
39312 provider_id

*Highest cardinality labels:*
26763 id
17988 name
11127 container_id
9252 uid
9249 pod
5977 address
5022 pod_ip
4502 pod_uid
4203 interface
4135 device
1987 instance
1773 owner_name
1741 replicaset
1741 label_pod_template_hash
1422 __name__
1164 created_by_name
937 node
937 host_ip
936 label_kubernetes_io_hostname
936 nodename

*Highest cardinality metric names:*
178836 kubelet_runtime_operations_duration_seconds_bucket
161805 container_tasks_state
142212 storage_operation_duration_seconds_bucket
129444 container_memory_failures_total
121212 kubelet_docker_operations_duration_seconds_bucket
67739 kube_pod_container_status_waiting_reason
59724 kubelet_http_requests_duration_seconds_bucket
58062 kube_pod_container_status_terminated_reason
58062 kube_pod_container_status_last_terminated_reason
50292 rest_client_request_duration_seconds_bucket
46260 kube_pod_status_phase
44709 kubelet_runtime_operations_latency_microseconds
39645 container_network_receive_packets_dropped_total
39645 container_network_transmit_bytes_total
39645 container_network_transmit_errors_total
39645 container_network_transmit_packets_total
39645 container_network_receive_packets_total
39645 container_network_transmit_packets_dropped_total
39645 container_network_receive_bytes_total
39645 container_network_receive_errors_total


On Tue, Sep 7, 2021 at 9:51 PM Brian Candler <[email protected]> wrote:

> Such a short retention is unlikely to help at all; WAL blocks have a 2
> hour duration I think.
>
> Across some systems I have here, the average number of metrics per node is
> 2366: this is the (expensive) query which gives it:
> avg(count by (instance) ({job="node"}))
>
> So with 1300 nodes that would be about 3 million metrics.  Quite a lot,
> but not extraordinarily so.  I've seen recommendations to start splitting
> Prometheus servers when you reach 2m.  There is a RAM calculation tool here:
>
> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
> With 3m series and 1m unique label pairs, it still only comes out to 8GB.
> If you're needing much more than that, then you need to read and understand
> the stats from the TSDB status page.  You can post them here if you want
> help interpreting them.  And you need to understand what queries (if any)
> are taking place against your database, since those use RAM too.
>
> Looking at "Top 10 series count by metric names" in the Prometheus Status
> page, in my case it's node_cpu_seconds_total{}.  For me it's
> node_cpu_seconds_total{}.  If you don't require the usage of each core
> individually, then you might be inclined to drop it.
>
> You could also see if victoriametrics + vmagent works better for your use
> case.
>
> On Tuesday, 7 September 2021 at 13:57:48 UTC+1 [email protected] wrote:
>
>> Thank you Brian for the reply. Yes I mean host (nodes).
>> What we have done for the mean time is we have set the retentionTime of
>> prometheus to 5minutes (which I am not comfortable) but was advised by
>> seniors just for us to continue.
>> Thanks for the information above, i'll check it out and try on our
>> cluster environment.
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 7, 2021 at 4:50 PM Brian Candler <[email protected]> wrote:
>>
>>> It's not clear what you mean by "No. of Nodes" - whether you mean hosts
>>> (e.g. which you're scraping using node_exporter), or pods, or something
>>> else.  But what matters is the total number of metrics, and the amount of
>>> metric churn,  i.e. the rate at which new timeseries are being created
>>> dynamically; and also how much querying is going on.
>>>
>>> If you go to Prometheus web interface, Status > TSDB Status, you'll get
>>> some statistics which may help you.  Consider:
>>>
>>> - collecting fewer metrics (by changing what you scrape, and/or using
>>> metric_relabel_configs to drop some timeseries which are not of interest)
>>>
>>> - see if it's possible to reduce timeseries churn.  For example, if you
>>> have one application which is generating large numbers of short-lived pods
>>> then you may wish to reduce or suppress the metrics collected for those
>>> pods.
>>>
>>> - have a look at the PromQL queries being executed, and whether any of
>>> these are using excessing amounts of RAM.  The query log
>>> <https://prometheus.io/docs/guides/query-log/> may help.  You can also
>>> apply limits to how much memory is used by individual queries using
>>>       --query.max-concurrency=20  # default
>>>       --query.max-samples=50000000  # default
>>> (although that may cause the offending queries to fail)
>>>
>>> There are also blog posts out there which you can turn up with a search,
>>> e.g.
>>> https://source.coveo.com/2021/03/03/prometheus-memory/
>>>
>>> On Tuesday, 7 September 2021 at 07:34:51 UTC+1 [email protected] wrote:
>>>
>>>> Hi everyone, I am new here.
>>>>
>>>> I would like to seek some advice on the design approach we should take.
>>>> With the given problem below, in terms of cost, how can we set up
>>>> Prometheus with a large cluster.
>>>>
>>>> *Variables:*
>>>> *Installation: *Kube-stack-prometheus helm chart.
>>>> *Autoscale*: yes
>>>> *No. of Nodes*: 1000 up to 1300
>>>> *Mesh*: Istio
>>>> *Memory Usage:* 50GB (Still gets OOM)
>>>> *Installed: *1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger
>>>>
>>>> *Issue:*
>>>> 1. We cannot expand a larger node for Prometheus as 60GB memory is
>>>> already expensive.  (cost not approved by management)
>>>> 2. Removing unnecessary metrics is not yet advised because we do not
>>>> know which metrics of istio, jaeger and kiali are needed.
>>>>
>>>> *Tried solution:*
>>>> We have federated the single instance of prometheus with Thanos
>>>> Receivers, however, the issue is still there because kiali queries its data
>>>> directly from prometheus which eventually gets OOM.
>>>>
>>>> *Question:*
>>>> We are thinking of firing up multiple prometheus for each namespace and
>>>> adding thanos-sidecar with the same scrape config since thanos will
>>>> deduplicate all duplicated metrics. This approach would solve the issue in
>>>> Grafana queries but not in Kiali.
>>>>
>>>> How can we set up a multiple prometheus (low cost) but single instance
>>>> prometheus for kiali (whole cluster)?
>>>>
>>>> Appreciate any help. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAAMbZt_VUsys270vZ%2BcRpyo%3Dar8biLiBzOLO_ELezFA3tkMKCg%40mail.gmail.com.

Re: [prometheus-users] Re: Single Prometheus for Large Cluster

Reply via email to