> > So the question is, is the extra multiple of 1000 incorrect in the 'OSD > Overview' dashboard? Or am I not understanding things correctly?
latency_count is integer, returns numbers of samples, latency_sum is sum of latencies from _count samples in seconds (so you multiply it by 1000 to get ms) On Thu, 24 Jul 2025 at 20:44, Christopher Durham <caduceu...@aol.com> wrote: > In my 19.2.2/squid cluster, (Rocky 9 Linux) I am trying to determine if I > am having > issues with OSD latency. The following URL: > > https://sysdig.com/blog/monitoring-ceph-prometheus/ > > states the following about prometheus metrics: > > * ceph_osd_op_r_latency_count: Returns the number of reading operations > running. > * ceph_osd_op_r_latency_sum: Returns the time, in milliseconds, taken by > the reading operations. This metric includes the queue time. > * ceph_osd_op_w_latency_count: Returns the number of writing operations > running. > * ceph_osd_op_w_latency_sum: Returns the time, in milliseconds, taken by > the writing operations. This metric includes the queue time. > > and > > * ceph_osd_commit_latency_ms: Returns the time it takes OSD to read or > write to the journal. > * ceph_osd_apply_latency_ms: Returns the time it takes to write the > journal to the physical disk. > > > The first set states 'includes the queue time'. What exactly does this > mean? Does this mean that this is the time waiting before writing to the > journal while in the memory of the ceph-osd daemon? If so, does the latter > two metrics mean that once the writes start, > this is the time it takes to write to the journal or the disk? > > Does the first set of metrics *include* the latter? In other words, are > the apply/commit latencies included in the *[r,w]_latency_sum? > > The URL above suggests that to calculate the write latency for a given > OSD, you do the following: > > (rate(ceph_osd_op_w_latency_sum[5m]) / > rate(ceph_osd_op_w_latency_count[5m]) >= 0) > > However, the grafana dashboard 'OSD Overview' (in ceph-dashboards-19.2.2 > rpm), does something very similar for max,avg,quantile: > > max(rate(ceph_osd_op_w_max_latency_sum(cluster=\$cluster\,}[$__rate_interval])/ > on (ceph_daemon) > rate(ceph_osd_op_w_max_latency_count(cluster=\$cluster\,}[$__rate_interval]) > * 1000)) > > The extra multiple of 1000 seems extraneous based on the fact that the > *latency_count is already in milliseconds, and the graph itself shows 'ms'. > and led me to think I have disk latency issues, as these numbers are high. > (maybe I do, and maybe I misunderstand something) > > Other dashboards such as 'Ceph Cluster - Advanced' which, in the panel > 'OSD Commit Latency Distribution' use > similar promQL expressions but without the extra multiple of 1000, which > looks alot better for evaluation of my latencies. > > So the question is, is the extra multiple of 1000 incorrect in the 'OSD > Overview' dashboard? Or am I not understanding things correctly? > > Also, does 'ceph osd perf' just show the apply/commit sum/count metrics > from above? > > Thanks for any assistance. > -Chris > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Ćukasz Borek luk...@borek.org.pl _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io