In my 19.2.2/squid cluster, (Rocky 9 Linux)  I am trying to determine if I am 
having
issues with OSD latency. The following URL:

https://sysdig.com/blog/monitoring-ceph-prometheus/

states the following about prometheus metrics:

* ceph_osd_op_r_latency_count: Returns the number of reading operations running.
* ceph_osd_op_r_latency_sum: Returns the time, in milliseconds, taken by the 
reading operations. This metric includes the queue time.
* ceph_osd_op_w_latency_count: Returns the number of writing operations running.
* ceph_osd_op_w_latency_sum: Returns the time, in milliseconds, taken by the 
writing operations. This metric includes the queue time.

and

* ceph_osd_commit_latency_ms: Returns the time it takes OSD to read or write to 
the journal.
* ceph_osd_apply_latency_ms: Returns the time it takes to write the journal to 
the physical disk.


The first set states 'includes the queue time'. What exactly does this mean? 
Does this mean that this is the time waiting before writing to the
journal while in the memory of the ceph-osd daemon? If so, does the latter two 
metrics mean that once the writes start,
this is the time it takes to write to the journal or the disk?

Does the first set of metrics *include* the latter? In other words, are the 
apply/commit latencies included in the *[r,w]_latency_sum?

The URL above suggests that to calculate the write latency for a given OSD, you 
do the following:

(rate(ceph_osd_op_w_latency_sum[5m]) / rate(ceph_osd_op_w_latency_count[5m]) >= 
0)

However, the grafana dashboard 'OSD Overview' (in ceph-dashboards-19.2.2 rpm), 
does something very similar for max,avg,quantile:

max(rate(ceph_osd_op_w_max_latency_sum(cluster=\$cluster\,}[$__rate_interval])/ 
on (ceph_daemon) 
rate(ceph_osd_op_w_max_latency_count(cluster=\$cluster\,}[$__rate_interval]) * 
1000))

The extra multiple of 1000 seems extraneous based on the fact that the 
*latency_count is already in milliseconds, and the graph itself shows 'ms'.
and led me to think I have disk latency issues, as these numbers are high. 
(maybe I do, and maybe I misunderstand something)

Other dashboards such as 'Ceph Cluster - Advanced' which, in the panel 'OSD 
Commit Latency Distribution' use
similar promQL expressions but without the extra multiple of 1000, which looks 
alot better for evaluation of my latencies.

So the question is, is the extra multiple of 1000 incorrect in the 'OSD 
Overview' dashboard? Or am I not understanding things correctly?

Also, does 'ceph osd perf' just show the apply/commit sum/count metrics from 
above?

Thanks for any assistance.
-Chris

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to