[ceph-users] Scrub errors after OSD_TOO_MANY_REPAIRS: how to recover?

Michel Jouvin Sat, 20 Sep 2025 16:21:19 -0700

Hi,

We experienced one week ago problems with a dying OSD resulting inOSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoringsystem is not notifying these errors properly). When we realized theerror, we removed the problematic OSD (ceph orch osd rm --replace),despite the scrubbing errors: the resulting backfills succeeded but didnot fix the scrub errors. The collegue who took care of this problemdecided to lauch a `ceph pg repair` on the 3 PGs with reportedinconsistencies but it doesn't seem to converge. 'ceph -s' still reports :


             3    active+clean+scrubbing+deep+inconsistent+repair

after a few hours and for at least one of the PG, there is the followingmessage every 3s:

Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scruberrors, upgrading scrub to deep-scrub

Not sure if it is the sign of a problem or just because the operation isongoing. I'm looking for advices on what to do to move forward. Therewas not yet a report from users of an impact of this but it doesn't meanthere is none... The affected pool is storing RBD volumes (fromOpenStack Cinder).

A side issue is that we have 'osd_scrub_auto_repair=true' so I'd expectthe repair to start automatically but it doesn't seem it was the case...


Thanks in advance for any advice. Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Scrub errors after OSD_TOO_MANY_REPAIRS: how to recover?

Reply via email to