Hi, We are running multiple Prometheus instances in Kubernetes (deployed using Prometheus Operator) and hope that someone can help us understanding why the RAM usage in a few of our instances are unexpectedly high (we think it's cardinality but not sure where to look)
In Prometheus A, we have the following stat: Number of Series: 56486 Number of Chunks: 56684 Number of Label Pairs: 678 tsdb analyze has the following result: /bin $ ./promtool tsdb analyze /prometheus/ Block ID: 01GQGMKZAF548DPE2DFZTF1TRW Duration: 1h59m59.368s Series: 56470 Label names: 26 Postings (unique label pairs): 678 Postings entries (total label pairs): 338705 This instance uses roughly between 4Gb - 5Gb of RAM (measured by Kubernetes). >From our reading, each time series should use around 8kb of RAM so for 56k series should be using a mere 500Mb. On a different Prometheus instance (let's call it Prometheus Central) we have 1,1m series and it's using 9Gb - 10Gb which is roughly what is expected. We're curious about this instance and we believe it's cardinality. We have a lot more targets in Prometheus A. I also note that the Posting entries (total label pairs) is 338k but I'm not sure where to look for this. The top entries from tsdb analyze is right at the bottom of this post. The "most common label pairs" entries have alarmingly high count, I wonder if this contributes the high "total label pairs" and consequently higher than expected RAM usage. When calculating the expected RAM usage, is the "total label pairs" is the number we need to use rather than the "total series" Thanks, Victor Label pairs most involved in churning: 296 activity_type=none 258 workflow_type=PodUpdateWorkflow 163 __name__=temporal_request_latency_bucket 104 workflow_type=GenerateSPVarsWorkflow 95 operation=RespondActivityTaskCompleted 89 __name__=temporal_activity_execution_latency_bucket 89 __name__=temporal_activity_schedule_to_start_latency_bucket 65 workflow_type=PodInitWorkflow 53 operation=RespondWorkflowTaskCompleted 49 __name__=temporal_workflow_endtoend_latency_bucket 49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket 49 __name__=temporal_workflow_task_execution_latency_bucket 49 __name__=temporal_workflow_task_replay_latency_bucket 39 activity_type=UpdatePodConnectionsActivity 38 le=+Inf 38 le=0.02 38 le=0.1 38 le=0.001 38 activity_type=GenerateSPVarsActivity 38 le=5 Label names most involved in churning: 734 __name__ 734 job 724 instance 577 activity_type 577 workflow_type 541 le 177 operation 95 datname 53 datid 31 mode 29 namespace 21 state 12 quantile 11 container 11 service 11 pod 11 endpoint 10 scrape_job 4 alertname 4 severity Most common label pairs: 23012 activity_type=none 20060 workflow_type=PodUpdateWorkflow 12712 __name__=temporal_request_latency_bucket 8092 workflow_type=GenerateSPVarsWorkflow 7440 operation=RespondActivityTaskCompleted 6944 __name__=temporal_activity_execution_latency_bucket 6944 __name__=temporal_activity_schedule_to_start_latency_bucket 5100 workflow_type=PodInitWorkflow 4140 operation=RespondWorkflowTaskCompleted 3864 __name__=temporal_workflow_task_replay_latency_bucket 3864 __name__=temporal_workflow_endtoend_latency_bucket 3864 __name__=temporal_workflow_task_schedule_to_start_latency_bucket 3864 __name__=temporal_workflow_task_execution_latency_bucket 3080 activity_type=UpdatePodConnectionsActivity 3004 le=0.5 3004 le=0.01 3004 le=0.1 3004 le=1 3004 le=0.001 3004 le=0.002 Label names with highest cumulative label value length: 8312 scrape_job 4279 workflow_type 3994 rule_group 2614 __name__ 2478 instance 1564 job 434 datname 248 activity_type 139 mode 128 operation 109 version 97 pod 88 state 68 service 45 le 44 namespace 43 slice 31 container 28 quantile 18 alertname Highest cardinality labels: 138 instance 138 scrape_job 84 __name__ 75 workflow_type 71 datname 70 job 19 rule_group 14 le 10 activity_type 9 mode 9 quantile 6 state 6 operation 5 datid 4 slice 2 container 2 pod 2 alertname 2 version 2 service Highest cardinality metric names: 12712 temporal_request_latency_bucket 6944 temporal_activity_execution_latency_bucket 6944 temporal_activity_schedule_to_start_latency_bucket 3864 temporal_workflow_task_schedule_to_start_latency_bucket 3864 temporal_workflow_task_replay_latency_bucket 3864 temporal_workflow_task_execution_latency_bucket 3864 temporal_workflow_endtoend_latency_bucket 2448 pg_locks_count 1632 pg_stat_activity_count 908 temporal_request 690 prometheus_target_sync_length_seconds 496 temporal_activity_execution_latency_count 350 go_gc_duration_seconds 340 pg_stat_database_tup_inserted 340 pg_stat_database_temp_bytes 340 pg_stat_database_xact_commit 340 pg_stat_database_xact_rollback 340 pg_stat_database_tup_updated 340 pg_stat_database_deadlocks 340 pg_stat_database_tup_returned -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com.