I encountered persistent SLOW_OPS just a few days ago on a recently upgraded 
13.2.8 cluster, which has an SSD pool and an HDD pool. All OSDs are Bluestore, 
we're not using separate journal / DB volumes. The HDD pool is more or less 
used for cold storage, so performance is not critical.

One OSD in particular (HDD) was reporting the SLOW_OPS. I suspected that the 
drive was on the way out, but SMART stats looked ok, and there were no IO 
errors reported in the kernel log. Restarting that OSD helped initially, but 
eventually the SLOW_OPS starting to pile up again.

We have a fair number of VMs running from RBDs, most of them on the SSD pool, 
but a few on HDD. Most of the VMs are configured with a weekly fstrim cronjob, 
and we have QEMU configured to pass the DISCARD commands down to Ceph. One VM 
however, which has a bunch of 50 GB files as part of a Bareos setup (fork of 
Bacula), has the filesystem mounted with discard option, so it will trim 
immediately when files are deleted. I tracked the SLOW_OPS to a time period 
during which that VM was recycling (i.e., deleting & trimming) some of these 
large 50 GB files. In other words, it seems that there might be a performance 
regression in deleting large numbers of rados objects at once.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to