Hi,
We experienced one week ago problems with a dying OSD resulting in
OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring
system is not notifying these errors properly). When we realized the
error, we removed the problematic OSD (ceph orch osd rm --replace),
despite the scrubbing errors: the resulting backfills succeeded but did
not fix the scrub errors. The collegue who took care of this problem
decided to lauch a `ceph pg repair` on the 3 PGs with reported
inconsistencies but it doesn't seem to converge. 'ceph -s' still reports :
3 active+clean+scrubbing+deep+inconsistent+repair
after a few hours and for at least one of the PG, there is the following
message every 3s:
Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub
errors, upgrading scrub to deep-scrub
Not sure if it is the sign of a problem or just because the operation is
ongoing. I'm looking for advices on what to do to move forward. There
was not yet a report from users of an impact of this but it doesn't mean
there is none... The affected pool is storing RBD volumes (from
OpenStack Cinder).
A side issue is that we have 'osd_scrub_auto_repair=true' so I'd expect
the repair to start automatically but it doesn't seem it was the case...
Thanks in advance for any advice. Best regards,
Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io