[ceph-users] Very slow snaptrim operations blocking client I/O

Victor Rodriguez Fri, 27 Jan 2023 05:52:03 -0800

Hello,

Asking for help with an issue. Maybe someone has a clue about what'sgoing on.

Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removedit. A bit later, nearly half of the PGs of the pool entered snaptrim andsnaptrim_wait state, as expected. The problem is that such operationsran extremely slow and client I/O was nearly nothing, so all VMs in thecluster got stuck as they could not I/O to the storage. Taking andremoving big snapshots is a normal operation that we do often and thisis the first time I see this issue in any of my clusters.

Disks are all Samsung PM1733 and network is 25G. It gives us plenty ofperformance for the use case and never had an issue with the hardware.

Both disk I/O and network I/O was very low. Still, client I/O seemed toget queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stopsany active snaptrim operation and client I/O resumes back to normal.Enabling snaptrim again makes client I/O to almost halt again.


I've been playing with some settings:

ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

None really seemed to help. Also tried restarting OSD services.

This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Isthere any setting that must be changed which may cause this problem?

I have scheduled a maintenance window, what should I look for todiagnose this problem?


Any help is very appreciated. Thanks in advance.

Victor


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Very slow snaptrim operations blocking client I/O

Reply via email to