I setup an SSD Luminous 12.2.11 cluster and realized after data had been
added that pg_num was not set properly on the default.rgw.buckets.data pool
( where all the data goes ). I adjusted the settings up, but recovery is
going really slow ( like 56-110MiB/s ) ticking down at .002 per log
entry(ceph -w). These are all SSDs on luminous 12.2.11 ( no journal drives
) with a set of 2 10Gb fiber twinax in a bonded LACP config. There are six
servers, 60 OSDs, each OSD is 2TB. There was about 4TB of data ( 3 million
objects ) added to the cluster before I noticed the red blinking lights.
I tried adjusting the recovery to:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'
Which did help a little, but didn't seem to have the impact I was looking
for. I have used the settings on HDD clusters before to speed things up (
using 8 backfills and 4 max active though ). Did I miss something or is
this part of the pg expansion process. Should I be doing something else
with SSD clusters?
Regards,
-Brent
Existing Clusters:
Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
virtual on SSD )
US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
behind haproxy LB
UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
gateways behind haproxy LB
US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
gateways behind haproxy LB
Try to lower `osd_recovery_sleep*` options.
You can get your current values from ceph admin socket like this:
```
ceph daemon osd.0 config show | jq 'to_entries[] | if
(.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'
```
k
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com