[ceph-users] Re: Scrub errors after OSD_TOO_MANY_REPAIRS: how to recover?

Janne Johansson Thu, 18 Sep 2025 03:40:33 -0700

> We experienced one week ago problems with a dying OSD resulting in
> OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring
> system is not notifying these errors properly). When we realized the
> error, we removed the problematic OSD (ceph orch osd rm --replace),
> despite the scrubbing errors: the resulting backfills succeeded but did
> not fix the scrub errors. The collegue who took care of this problem
> decided to lauch a `ceph pg repair` on the 3 PGs with reported
> inconsistencies but it doesn't seem to converge. 'ceph -s' still reports :
>
>               3    active+clean+scrubbing+deep+inconsistent+repair
>
> after a few hours and for at least one of the PG, there is the following
> message every 3s:
>
> Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub
> errors, upgrading scrub to deep-scrub
>
> Not sure if it is the sign of a problem or just because the operation is
> ongoing. I'm looking for advices on what to do to move forward. There
> was not yet a report from users of an impact of this but it doesn't mean
> there is none... The affected pool is storing RBD volumes (from
> OpenStack Cinder).


Can't say how long it should take, but repairs can take a while. For
me, it's usually takes a long while until the
"active+clean+scrubbing+deep+inconsistent+repair" status appears, then
I guess it is dependant on disk (and possibly wpq -vs- mclock?) perf.
I would stay calm for a while to let the cluster try to get itself
right.


-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Scrub errors after OSD_TOO_MANY_REPAIRS: how to recover?

Reply via email to