> We experienced one week ago problems with a dying OSD resulting in > OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring > system is not notifying these errors properly). When we realized the > error, we removed the problematic OSD (ceph orch osd rm --replace), > despite the scrubbing errors: the resulting backfills succeeded but did > not fix the scrub errors. The collegue who took care of this problem > decided to lauch a `ceph pg repair` on the 3 PGs with reported > inconsistencies but it doesn't seem to converge. 'ceph -s' still reports : > > 3 active+clean+scrubbing+deep+inconsistent+repair > > after a few hours and for at least one of the PG, there is the following > message every 3s: > > Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub > errors, upgrading scrub to deep-scrub > > Not sure if it is the sign of a problem or just because the operation is > ongoing. I'm looking for advices on what to do to move forward. There > was not yet a report from users of an impact of this but it doesn't mean > there is none... The affected pool is storing RBD volumes (from > OpenStack Cinder).
Can't say how long it should take, but repairs can take a while. For me, it's usually takes a long while until the "active+clean+scrubbing+deep+inconsistent+repair" status appears, then I guess it is dependant on disk (and possibly wpq -vs- mclock?) perf. I would stay calm for a while to let the cluster try to get itself right. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io