[ceph-users] Re: Scrubbing behind after Quincy to Reef upgrade

Eugen Block via ceph-users Thu, 11 Jun 2026 14:00:21 -0700

I feel like this has been discussed multiple times on this list, Ijust don't have any links at hand. I suspect mclock settings, mostlikely some things changed between Quincy and Reef. You could fallback to wpq instead of mclock, it's still a general recommendation atthe moment, and then see if anyting improves.


Zitat von Reed Dier via ceph-users <[email protected]>:

Hello all,
TL;DR is that the number of concurrent PGs scrubbing, deep orotherwise, has appeared to increase by about 5-10x, while the numberof PGs complaining that they haven't been scrubbed, deep orotherwise, has continued to tick higher.
HEALTH_WARN 217 pgs not deep-scrubbed in time; 187 pgs not scrubbed in time
Hoping there may be something that I must have missed in releasenotes and mailing list that explains why my scrubs both exploded inconcurrency, as well as fell behind after upgrading from quincy(17.2.9) to reef (18.2.8).
Non-cephadm, U22.04, rather heterogenous OSD hardware.
Mix of 8T and 2T HDD, as well as 2T SSD.
HDD's have NVMe WAL/DB of various sizes depending on when they were deployed.
Mix of replicated and EC pools, as well as some replicated poolsacross different device classes.
The vast majority of the PG's that are behind on scrubbing are on ECpools, and the vast majority of that, is our EC82 cephfs pool (40)that holds the bulk of our stored data, and the other largest poolis an older EC73 cephfs pool (37).
My quick and dirty approximation based on PGs last scrubbed last month.
ceph pg dump | grep 2026-05 | awk '{print $1" "$27}' | grep -vperiodic | cut -d '.' -f1 | sort | uniq -c
dumped all
      1 17
      1 20
    116 37
    224 40
I didn't make any changes to scrub intervals or mclock profilesbefore/during/after the upgrade.
ceph config dump | grep mclock_profile | awk '{print $4}' | uniq -c ;
    313 balanced
ceph config dump | grep scrub_interval
global class:ssd advanced osd_deep_scrub_interval604800.000000mon advanced osd_deep_scrub_interval604800.000000mon.* advanced osd_deep_scrub_interval604800.000000mgr.* advanced osd_deep_scrub_interval604800.000000osd class:hdd advanced osd_deep_scrub_interval604800.000000osd class:ssd advanced osd_deep_scrub_interval604800.000000osd advanced osd_deep_scrub_interval604800.000000osd.* advanced osd_deep_scrub_interval604800.000000
I've tried ceph tell osd.$osd osd_max_scrubs $more, which seems tosomewhat momentarily drive the count ofactive+clean+scrubbing[+deep] PGs, but doesn't seem to make ademonstrative difference in terms of getting ahead in the number ofPGs behind (number continues to grow).I also looked at load15 across OSD hosts, and they don't appear tobe anywhere near the 50% threshold of osd_scrub_load_thresholdeither, so I think I can rule that one out for now.
I'm mostly curious why the change in behavior of concurrent scrubsballooning, and yet the number of PGs behind on scrubbing ballooningas well, without anything actually changing.And I'm also curious what tunables I can turn to get things backunder control for scrubbing both short and long term as I looktowards getting to squid and 24.04.Is there an internal mechanism that triggers a deeper scrub duringfirst deep scrub after upgrading a major release, reef or otherwise?
Included some graphs of scrub load over the last 60 and 365 dayperiod to show prior scrub load that only exceedingly rarely evergenerated a PG_NOT_[DEEP_]SCRUBBED warning,as well as raw load average (smallest cpu count is 16, and itdoesn't even autoscale to 8, so nothing should be complaining there.)
https://imgur.com/a/rixNrCe

Appreciate any pointers anyone can steer me towards.
Reed



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Scrubbing behind after Quincy to Reef upgrade

Reply via email to