In my 19.2.2/squid cluster, (Rocky 9 Linux) I am trying to determine if I am having issues with OSD latency. The following URL:
https://sysdig.com/blog/monitoring-ceph-prometheus/ states the following about prometheus metrics: * ceph_osd_op_r_latency_count: Returns the number of reading operations running. * ceph_osd_op_r_latency_sum: Returns the time, in milliseconds, taken by the reading operations. This metric includes the queue time. * ceph_osd_op_w_latency_count: Returns the number of writing operations running. * ceph_osd_op_w_latency_sum: Returns the time, in milliseconds, taken by the writing operations. This metric includes the queue time. and * ceph_osd_commit_latency_ms: Returns the time it takes OSD to read or write to the journal. * ceph_osd_apply_latency_ms: Returns the time it takes to write the journal to the physical disk. The first set states 'includes the queue time'. What exactly does this mean? Does this mean that this is the time waiting before writing to the journal while in the memory of the ceph-osd daemon? If so, does the latter two metrics mean that once the writes start, this is the time it takes to write to the journal or the disk? Does the first set of metrics *include* the latter? In other words, are the apply/commit latencies included in the *[r,w]_latency_sum? The URL above suggests that to calculate the write latency for a given OSD, you do the following: (rate(ceph_osd_op_w_latency_sum[5m]) / rate(ceph_osd_op_w_latency_count[5m]) >= 0) However, the grafana dashboard 'OSD Overview' (in ceph-dashboards-19.2.2 rpm), does something very similar for max,avg,quantile: max(rate(ceph_osd_op_w_max_latency_sum(cluster=\$cluster\,}[$__rate_interval])/ on (ceph_daemon) rate(ceph_osd_op_w_max_latency_count(cluster=\$cluster\,}[$__rate_interval]) * 1000)) The extra multiple of 1000 seems extraneous based on the fact that the *latency_count is already in milliseconds, and the graph itself shows 'ms'. and led me to think I have disk latency issues, as these numbers are high. (maybe I do, and maybe I misunderstand something) Other dashboards such as 'Ceph Cluster - Advanced' which, in the panel 'OSD Commit Latency Distribution' use similar promQL expressions but without the extra multiple of 1000, which looks alot better for evaluation of my latencies. So the question is, is the extra multiple of 1000 incorrect in the 'OSD Overview' dashboard? Or am I not understanding things correctly? Also, does 'ceph osd perf' just show the apply/commit sum/count metrics from above? Thanks for any assistance. -Chris _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io