Re: [prometheus-users] Prometheus RAM usage investigation

Brian Candler Wed, 01 Feb 2023 03:07:34 -0800

That makes sense.  Hopefully the LTS support for 2.37 can be extended in 
the mean time.


On Wednesday, 1 February 2023 at 10:45:34 UTC Julien Pivotto wrote:

> On 01 Feb 02:00, Brian Candler wrote:
> > Aside: is 2.42.0 going to be an LTS version?
>
> Hello,
>
> I have not updated the website yet, but 2.42 will not be a LTS version.
>
> My feeling is that we still need a few releases so that the native
> histogram and OOO ingestion "stabilizes". It is not about waiting for
> them to be stable, but more making sure that the eventual bugs
> introduced in the codebase by those two major features are noticed and
> fixed.
>
>
> > 
> > On Wednesday, 1 February 2023 at 09:35:00 UTC sup...@gmail.com wrote:
> > 
> > > Or upgrade to 2.42.0. :)
> > >
> > > On Wed, Feb 1, 2023 at 9:48 AM Julien Pivotto <roidel...@prometheus.io> 
>
> > > wrote:
> > >
> > >> On 24 Jan 21:43, Victor Hadianto wrote:
> > >> > > Also, what version(s) of prometheus are these two instances?
> > >> > 
> > >> > They are both the same:
> > >> > prometheus, version 2.37.0 (branch: HEAD, revision:
> > >> > b41e0750abf5cc18d8233161560731de05199330)
> > >>
> > >> Please update to 2.37.5. There has been a memory leak fixed in 2.37.3.
> > >>
> > >>
> > >>
> > >> > 
> > >> > > The RAM usage of Prometheus depends on a number of factors. 
> There's a
> > >> > calculator embedded in this article, but it's pretty old now:
> > >> > 
> > >> 
> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
> > >> > 
> > >> > Thanks for this, I'll read & play around with that calculator for 
> our
> > >> > Prometheus instances (we have 9 in various clusters now).
> > >> > 
> > >> > Regards,
> > >> > Victor
> > >> > 
> > >> > 
> > >> > On Tue, 24 Jan 2023 at 21:03, Brian Candler <b.ca...@pobox.com> 
> wrote:
> > >> > 
> > >> > > Also, what version(s) of prometheus are these two instances? 
> Different
> > >> > > versions of Prometheus are compiled using different versions of 
> Go, 
> > >> which
> > >> > > in turn have different degrees of aggressiveness in returning 
> unused 
> > >> RAM to
> > >> > > the operating system. Also remember Go is a garbage-collected 
> > >> language.
> > >> > >
> > >> > > The RAM usage of Prometheus depends on a number of factors. 
> There's a
> > >> > > calculator embedded in this article, but it's pretty old now:
> > >> > >
> > >> > > 
> > >> 
> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
> > >> > >
> > >> > > On Tuesday, 24 January 2023 at 09:29:47 UTC sup...@gmail.com 
> wrote:
> > >> > >
> > >> > >> When you say "measured by Kubernetes", what metric specifically?
> > >> > >>
> > >> > >> There are several misleading metrics. What matters is
> > >> > >> `container_memory_rss` or `container_memory_working_set_bytes`. 
> The
> > >> > >> `container_memmory_usage_bytes` is misleading because it 
> includes 
> > >> page
> > >> > >> cache values.
> > >> > >>
> > >> > >> On Tue, Jan 24, 2023 at 10:20 AM Victor H <vhad...@gmail.com> 
> wrote:
> > >> > >>
> > >> > >>> Hi,
> > >> > >>>
> > >> > >>> We are running multiple Prometheus instances in Kubernetes 
> (deployed
> > >> > >>> using Prometheus Operator) and hope that someone can help us 
> > >> understanding
> > >> > >>> why the RAM usage in a few of our instances are unexpectedly 
> high 
> > >> (we think
> > >> > >>> it's cardinality but not sure where to look)
> > >> > >>>
> > >> > >>> In Prometheus A, we have the following stat:
> > >> > >>>
> > >> > >>> Number of Series: 56486
> > >> > >>> Number of Chunks: 56684
> > >> > >>> Number of Label Pairs: 678
> > >> > >>>
> > >> > >>> tsdb analyze has the following result:
> > >> > >>>
> > >> > >>> /bin $ ./promtool tsdb analyze /prometheus/
> > >> > >>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
> > >> > >>> Duration: 1h59m59.368s
> > >> > >>> Series: 56470
> > >> > >>> Label names: 26
> > >> > >>> Postings (unique label pairs): 678
> > >> > >>> Postings entries (total label pairs): 338705
> > >> > >>>
> > >> > >>> This instance uses roughly between 4Gb - 5Gb of RAM (measured by
> > >> > >>> Kubernetes).
> > >> > >>>
> > >> > >>> From our reading, each time series should use around 8kb of RAM 
> so 
> > >> for
> > >> > >>> 56k series should be using a mere 500Mb.
> > >> > >>>
> > >> > >>> On a different Prometheus instance (let's call it Prometheus 
> > >> Central) we
> > >> > >>> have 1,1m series and it's using 9Gb - 10Gb which is roughly 
> what is
> > >> > >>> expected.
> > >> > >>>
> > >> > >>> We're curious about this instance and we believe it's 
> cardinality. 
> > >> We
> > >> > >>> have a lot more targets in Prometheus A. I also note that the 
> > >> Posting
> > >> > >>> entries (total label pairs) is 338k but I'm not sure where to 
> look 
> > >> for this.
> > >> > >>>
> > >> > >>> The top entries from tsdb analyze is right at the bottom of 
> this 
> > >> post.
> > >> > >>> The "most common label pairs" entries have alarmingly high 
> count, I 
> > >> wonder
> > >> > >>> if this contributes the high "total label pairs" and 
> consequently 
> > >> higher
> > >> > >>> than expected RAM usage.
> > >> > >>>
> > >> > >>> When calculating the expected RAM usage, is the "total label 
> pairs" 
> > >> is
> > >> > >>> the number we need to use rather than the "total series"
> > >> > >>>
> > >> > >>> Thanks,
> > >> > >>> Victor
> > >> > >>>
> > >> > >>>
> > >> > >>> Label pairs most involved in churning:
> > >> > >>> 296 activity_type=none
> > >> > >>> 258 workflow_type=PodUpdateWorkflow
> > >> > >>> 163 __name__=temporal_request_latency_bucket
> > >> > >>> 104 workflow_type=GenerateSPVarsWorkflow
> > >> > >>> 95 operation=RespondActivityTaskCompleted
> > >> > >>> 89 __name__=temporal_activity_execution_latency_bucket
> > >> > >>> 89 __name__=temporal_activity_schedule_to_start_latency_bucket
> > >> > >>> 65 workflow_type=PodInitWorkflow
> > >> > >>> 53 operation=RespondWorkflowTaskCompleted
> > >> > >>> 49 __name__=temporal_workflow_endtoend_latency_bucket
> > >> > >>> 49 
> __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> > >> > >>> 49 __name__=temporal_workflow_task_execution_latency_bucket
> > >> > >>> 49 __name__=temporal_workflow_task_replay_latency_bucket
> > >> > >>> 39 activity_type=UpdatePodConnectionsActivity
> > >> > >>> 38 le=+Inf
> > >> > >>> 38 le=0.02
> > >> > >>> 38 le=0.1
> > >> > >>> 38 le=0.001
> > >> > >>> 38 activity_type=GenerateSPVarsActivity
> > >> > >>> 38 le=5
> > >> > >>>
> > >> > >>> Label names most involved in churning:
> > >> > >>> 734 __name__
> > >> > >>> 734 job
> > >> > >>> 724 instance
> > >> > >>> 577 activity_type
> > >> > >>> 577 workflow_type
> > >> > >>> 541 le
> > >> > >>> 177 operation
> > >> > >>> 95 datname
> > >> > >>> 53 datid
> > >> > >>> 31 mode
> > >> > >>> 29 namespace
> > >> > >>> 21 state
> > >> > >>> 12 quantile
> > >> > >>> 11 container
> > >> > >>> 11 service
> > >> > >>> 11 pod
> > >> > >>> 11 endpoint
> > >> > >>> 10 scrape_job
> > >> > >>> 4 alertname
> > >> > >>> 4 severity
> > >> > >>>
> > >> > >>> Most common label pairs:
> > >> > >>> 23012 activity_type=none
> > >> > >>> 20060 workflow_type=PodUpdateWorkflow
> > >> > >>> 12712 __name__=temporal_request_latency_bucket
> > >> > >>> 8092 workflow_type=GenerateSPVarsWorkflow
> > >> > >>> 7440 operation=RespondActivityTaskCompleted
> > >> > >>> 6944 __name__=temporal_activity_execution_latency_bucket
> > >> > >>> 6944 __name__=temporal_activity_schedule_to_start_latency_bucket
> > >> > >>> 5100 workflow_type=PodInitWorkflow
> > >> > >>> 4140 operation=RespondWorkflowTaskCompleted
> > >> > >>> 3864 __name__=temporal_workflow_task_replay_latency_bucket
> > >> > >>> 3864 __name__=temporal_workflow_endtoend_latency_bucket
> > >> > >>> 3864 
> > >> __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> > >> > >>> 3864 __name__=temporal_workflow_task_execution_latency_bucket
> > >> > >>> 3080 activity_type=UpdatePodConnectionsActivity
> > >> > >>> 3004 le=0.5
> > >> > >>> 3004 le=0.01
> > >> > >>> 3004 le=0.1
> > >> > >>> 3004 le=1
> > >> > >>> 3004 le=0.001
> > >> > >>> 3004 le=0.002
> > >> > >>>
> > >> > >>> Label names with highest cumulative label value length:
> > >> > >>> 8312 scrape_job
> > >> > >>> 4279 workflow_type
> > >> > >>> 3994 rule_group
> > >> > >>> 2614 __name__
> > >> > >>> 2478 instance
> > >> > >>> 1564 job
> > >> > >>> 434 datname
> > >> > >>> 248 activity_type
> > >> > >>> 139 mode
> > >> > >>> 128 operation
> > >> > >>> 109 version
> > >> > >>> 97 pod
> > >> > >>> 88 state
> > >> > >>> 68 service
> > >> > >>> 45 le
> > >> > >>> 44 namespace
> > >> > >>> 43 slice
> > >> > >>> 31 container
> > >> > >>> 28 quantile
> > >> > >>> 18 alertname
> > >> > >>>
> > >> > >>> Highest cardinality labels:
> > >> > >>> 138 instance
> > >> > >>> 138 scrape_job
> > >> > >>> 84 __name__
> > >> > >>> 75 workflow_type
> > >> > >>> 71 datname
> > >> > >>> 70 job
> > >> > >>> 19 rule_group
> > >> > >>> 14 le
> > >> > >>> 10 activity_type
> > >> > >>> 9 mode
> > >> > >>> 9 quantile
> > >> > >>> 6 state
> > >> > >>> 6 operation
> > >> > >>> 5 datid
> > >> > >>> 4 slice
> > >> > >>> 2 container
> > >> > >>> 2 pod
> > >> > >>> 2 alertname
> > >> > >>> 2 version
> > >> > >>> 2 service
> > >> > >>>
> > >> > >>> Highest cardinality metric names:
> > >> > >>> 12712 temporal_request_latency_bucket
> > >> > >>> 6944 temporal_activity_execution_latency_bucket
> > >> > >>> 6944 temporal_activity_schedule_to_start_latency_bucket
> > >> > >>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket
> > >> > >>> 3864 temporal_workflow_task_replay_latency_bucket
> > >> > >>> 3864 temporal_workflow_task_execution_latency_bucket
> > >> > >>> 3864 temporal_workflow_endtoend_latency_bucket
> > >> > >>> 2448 pg_locks_count
> > >> > >>> 1632 pg_stat_activity_count
> > >> > >>> 908 temporal_request
> > >> > >>> 690 prometheus_target_sync_length_seconds
> > >> > >>> 496 temporal_activity_execution_latency_count
> > >> > >>> 350 go_gc_duration_seconds
> > >> > >>> 340 pg_stat_database_tup_inserted
> > >> > >>> 340 pg_stat_database_temp_bytes
> > >> > >>> 340 pg_stat_database_xact_commit
> > >> > >>> 340 pg_stat_database_xact_rollback
> > >> > >>> 340 pg_stat_database_tup_updated
> > >> > >>> 340 pg_stat_database_deadlocks
> > >> > >>> 340 pg_stat_database_tup_returned
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>> --
> > >> > >>> You received this message because you are subscribed to the 
> Google
> > >> > >>> Groups "Prometheus Users" group.
> > >> > >>> To unsubscribe from this group and stop receiving emails from 
> it, 
> > >> send
> > >> > >>> an email to prometheus-use...@googlegroups.com.
> > >> > >>> To view this discussion on the web visit
> > >> > >>> 
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
> > >> > >>> <
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer
> > >> >
> > >> > >>> .
> > >> > >>>
> > >> > >> --
> > >> > > You received this message because you are subscribed to a topic 
> in the
> > >> > > Google Groups "Prometheus Users" group.
> > >> > > To unsubscribe from this topic, visit
> > >> > > 
> > >> 
> https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe
> > >> > > .
> > >> > > To unsubscribe from this group and all its topics, send an email 
> to
> > >> > > prometheus-use...@googlegroups.com.
> > >> > > To view this discussion on the web visit
> > >> > > 
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com
> > >> > > <
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer
> > >> >
> > >> > > .
> > >> > >
> > >> > 
> > >> > -- 
> > >> > You received this message because you are subscribed to the Google 
> > >> Groups "Prometheus Users" group.
> > >> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> > >> an email to prometheus-use...@googlegroups.com.
> > >> > To view this discussion on the web visit 
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com
> > >> .
> > >>
> > >> -- 
> > >> Julien Pivotto
> > >> @roidelapluie
> > >>
> > >> -- 
> > >> You received this message because you are subscribed to the Google 
> Groups 
> > >> "Prometheus Users" group.
> > >> To unsubscribe from this group and stop receiving emails from it, 
> send an 
> > >> email to prometheus-use...@googlegroups.com.
> > >>
> > > To view this discussion on the web visit 
> > >> 
> https://groups.google.com/d/msgid/prometheus-users/Y9onaJkBb8Quugae%40nixos
> > >> .
> > >>
> > >
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "Prometheus Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to prometheus-use...@googlegroups.com.
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/b1a2bd98-b65f-40f0-b92b-52fe8f34febbn%40googlegroups.com
> .
>
>
> -- 
> Julien Pivotto
> @roidelapluie
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/13438334-0a62-4cb3-9fee-1ccafbfe04c0n%40googlegroups.com.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to