[prometheus-users] Prometheus RAM usage investigation

Victor H Tue, 24 Jan 2023 01:20:44 -0800

Hi,

We are running multiple Prometheus instances in Kubernetes (deployed using 
Prometheus Operator) and hope that someone can help us understanding why 
the RAM usage in a few of our instances are unexpectedly high (we think 
it's cardinality but not sure where to look)


In Prometheus A, we have the following stat:

Number of Series: 56486
Number of Chunks: 56684
Number of Label Pairs: 678

tsdb analyze has the following result:

/bin $ ./promtool tsdb analyze /prometheus/
Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
Duration: 1h59m59.368s
Series: 56470
Label names: 26
Postings (unique label pairs): 678
Postings entries (total label pairs): 338705

This instance uses roughly between 4Gb - 5Gb of RAM (measured by 
Kubernetes).

>From our reading, each time series should use around 8kb of RAM so for 56k 
series should be using a mere 500Mb.

On a different Prometheus instance (let's call it Prometheus Central) we 
have 1,1m series and it's using 9Gb - 10Gb which is roughly what is 
expected.

We're curious about this instance and we believe it's cardinality. We have 
a lot more targets in Prometheus A. I also note that the Posting entries 
(total label pairs) is 338k but I'm not sure where to look for this.

The top entries from tsdb analyze is right at the bottom of this post. The 
"most common label pairs" entries have alarmingly high count, I wonder if 
this contributes the high "total label pairs" and consequently higher than 
expected RAM usage.

When calculating the expected RAM usage, is the "total label pairs" is the 
number we need to use rather than the "total series"

Thanks,
Victor


Label pairs most involved in churning:
296 activity_type=none
258 workflow_type=PodUpdateWorkflow
163 __name__=temporal_request_latency_bucket
104 workflow_type=GenerateSPVarsWorkflow
95 operation=RespondActivityTaskCompleted
89 __name__=temporal_activity_execution_latency_bucket
89 __name__=temporal_activity_schedule_to_start_latency_bucket
65 workflow_type=PodInitWorkflow
53 operation=RespondWorkflowTaskCompleted
49 __name__=temporal_workflow_endtoend_latency_bucket
49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
49 __name__=temporal_workflow_task_execution_latency_bucket
49 __name__=temporal_workflow_task_replay_latency_bucket
39 activity_type=UpdatePodConnectionsActivity
38 le=+Inf
38 le=0.02
38 le=0.1
38 le=0.001
38 activity_type=GenerateSPVarsActivity
38 le=5

Label names most involved in churning:
734 __name__
734 job
724 instance
577 activity_type
577 workflow_type
541 le
177 operation
95 datname
53 datid
31 mode
29 namespace
21 state
12 quantile
11 container
11 service
11 pod
11 endpoint
10 scrape_job
4 alertname
4 severity

Most common label pairs:
23012 activity_type=none
20060 workflow_type=PodUpdateWorkflow
12712 __name__=temporal_request_latency_bucket
8092 workflow_type=GenerateSPVarsWorkflow
7440 operation=RespondActivityTaskCompleted
6944 __name__=temporal_activity_execution_latency_bucket
6944 __name__=temporal_activity_schedule_to_start_latency_bucket
5100 workflow_type=PodInitWorkflow
4140 operation=RespondWorkflowTaskCompleted
3864 __name__=temporal_workflow_task_replay_latency_bucket
3864 __name__=temporal_workflow_endtoend_latency_bucket
3864 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
3864 __name__=temporal_workflow_task_execution_latency_bucket
3080 activity_type=UpdatePodConnectionsActivity
3004 le=0.5
3004 le=0.01
3004 le=0.1
3004 le=1
3004 le=0.001
3004 le=0.002

Label names with highest cumulative label value length:
8312 scrape_job
4279 workflow_type
3994 rule_group
2614 __name__
2478 instance
1564 job
434 datname
248 activity_type
139 mode
128 operation
109 version
97 pod
88 state
68 service
45 le
44 namespace
43 slice
31 container
28 quantile
18 alertname

Highest cardinality labels:
138 instance
138 scrape_job
84 __name__
75 workflow_type
71 datname
70 job
19 rule_group
14 le
10 activity_type
9 mode
9 quantile
6 state
6 operation
5 datid
4 slice
2 container
2 pod
2 alertname
2 version
2 service

Highest cardinality metric names:
12712 temporal_request_latency_bucket
6944 temporal_activity_execution_latency_bucket
6944 temporal_activity_schedule_to_start_latency_bucket
3864 temporal_workflow_task_schedule_to_start_latency_bucket
3864 temporal_workflow_task_replay_latency_bucket
3864 temporal_workflow_task_execution_latency_bucket
3864 temporal_workflow_endtoend_latency_bucket
2448 pg_locks_count
1632 pg_stat_activity_count
908 temporal_request
690 prometheus_target_sync_length_seconds
496 temporal_activity_execution_latency_count
350 go_gc_duration_seconds
340 pg_stat_database_tup_inserted
340 pg_stat_database_temp_bytes
340 pg_stat_database_xact_commit
340 pg_stat_database_xact_rollback
340 pg_stat_database_tup_updated
340 pg_stat_database_deadlocks
340 pg_stat_database_tup_returned






-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com.

[prometheus-users] Prometheus RAM usage investigation

Reply via email to