[ceph-users] Re: osd latencies and grafana dashboards, squid 19.2.2

Lukasz Borek Thu, 24 Jul 2025 22:22:15 -0700

>
> So the question is, is the extra multiple of 1000 incorrect in the 'OSD
> Overview' dashboard? Or am I not understanding things correctly?


latency_count is integer, returns numbers of samples, latency_sum is sum of
latencies from _count samples in seconds (so you multiply it by 1000 to get
ms)

On Thu, 24 Jul 2025 at 20:44, Christopher Durham <caduceu...@aol.com> wrote:

> In my 19.2.2/squid cluster, (Rocky 9 Linux)  I am trying to determine if I
> am having
> issues with OSD latency. The following URL:
>
> https://sysdig.com/blog/monitoring-ceph-prometheus/
>
> states the following about prometheus metrics:
>
> * ceph_osd_op_r_latency_count: Returns the number of reading operations
> running.
> * ceph_osd_op_r_latency_sum: Returns the time, in milliseconds, taken by
> the reading operations. This metric includes the queue time.
> * ceph_osd_op_w_latency_count: Returns the number of writing operations
> running.
> * ceph_osd_op_w_latency_sum: Returns the time, in milliseconds, taken by
> the writing operations. This metric includes the queue time.
>
> and
>
> * ceph_osd_commit_latency_ms: Returns the time it takes OSD to read or
> write to the journal.
> * ceph_osd_apply_latency_ms: Returns the time it takes to write the
> journal to the physical disk.
>
>
> The first set states 'includes the queue time'. What exactly does this
> mean? Does this mean that this is the time waiting before writing to the
> journal while in the memory of the ceph-osd daemon? If so, does the latter
> two metrics mean that once the writes start,
> this is the time it takes to write to the journal or the disk?
>
> Does the first set of metrics *include* the latter? In other words, are
> the apply/commit latencies included in the *[r,w]_latency_sum?
>
> The URL above suggests that to calculate the write latency for a given
> OSD, you do the following:
>
> (rate(ceph_osd_op_w_latency_sum[5m]) /
> rate(ceph_osd_op_w_latency_count[5m]) >= 0)
>
> However, the grafana dashboard 'OSD Overview' (in ceph-dashboards-19.2.2
> rpm), does something very similar for max,avg,quantile:
>
> max(rate(ceph_osd_op_w_max_latency_sum(cluster=\$cluster\,}[$__rate_interval])/
> on (ceph_daemon)
> rate(ceph_osd_op_w_max_latency_count(cluster=\$cluster\,}[$__rate_interval])
> * 1000))
>
> The extra multiple of 1000 seems extraneous based on the fact that the
> *latency_count is already in milliseconds, and the graph itself shows 'ms'.
> and led me to think I have disk latency issues, as these numbers are high.
> (maybe I do, and maybe I misunderstand something)
>
> Other dashboards such as 'Ceph Cluster - Advanced' which, in the panel
> 'OSD Commit Latency Distribution' use
> similar promQL expressions but without the extra multiple of 1000, which
> looks alot better for evaluation of my latencies.
>
> So the question is, is the extra multiple of 1000 incorrect in the 'OSD
> Overview' dashboard? Or am I not understanding things correctly?
>
> Also, does 'ceph osd perf' just show the apply/commit sum/count metrics
> from above?
>
> Thanks for any assistance.
> -Chris
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Łukasz Borek
luk...@borek.org.pl
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd latencies and grafana dashboards, squid 19.2.2

Reply via email to