Hi,

We experienced one week ago problems with a dying OSD resulting in OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring system is not notifying these errors properly). When we realized the error, we removed the problematic OSD (ceph orch osd rm --replace), despite the scrubbing errors: the resulting backfills succeeded but did not fix the scrub errors. The collegue who took care of this problem decided to lauch a `ceph pg repair` on the 3 PGs with reported inconsistencies but it doesn't seem to converge. 'ceph -s' still reports :

             3    active+clean+scrubbing+deep+inconsistent+repair

after a few hours and for at least one of the PG, there is the following message every 3s:

Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub errors, upgrading scrub to deep-scrub

Not sure if it is the sign of a problem or just because the operation is ongoing. I'm looking for advices on what to do to move forward. There was not yet a report from users of an impact of this but it doesn't mean there is none... The affected pool is storing RBD volumes (from OpenStack Cinder).

A side issue is that we have 'osd_scrub_auto_repair=true' so I'd expect the repair to start automatically but it doesn't seem it was the case...

Thanks in advance for any advice. Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to