Hi all,

New Quincy cluster here that I'm just running through some benchmarks against:

ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
11 nodes of 24x 18TB HDD OSDs, 2x 2.9TB SSD OSDs

I'm seeing a delay of almost exactly 10 minutes when I remove an OSD/node from 
the cluster until actual recovery IO begins. This is much different behaviour 
that what I'm used to in Nautilus previously, where recovery IO would commence 
within seconds. Downed OSDs are reflected in ceph health within a few seconds 
(as expected), and affected PGs show as undersized a few seconds later (as 
expected). I guess this 10-minute delay may even be a feature-- accidentally 
rebooting a node before setting recovery flags would prevent rebalancing, for 
example. Just thought it was worth asking in case it's a bug or something to 
look deeper into.  

I've read through the OSD config and all of my recovery tuneables look ok, for 
example: 
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/

[ceph: root@ /]# ceph config get osd osd_recovery_delay_start
20.000000
3[ceph: root@ /]# ceph config get osd osd_recovery_sleep
40.000000
5[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hdd
60.100000
7[ceph: root@ /]# ceph config get osd osd_recovery_sleep_ssd
80.000000
9[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hybrid
100.025000

Thanks in advance.

Ngā mihi,

Sean Matheny
HPC Cloud Platform DevOps Lead
New Zealand eScience Infrastructure (NeSI)

e: sean.math...@nesi.org.nz



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to