Hi,

Good news: the 3 repairs completed successfully a few hours after my message. I was not worrying too much but preferred to check that we were not missing something in this unusual situation. Generally we catch disk problems before they too ill or they die suddenly. The OSD_TOO_MANY_REPAIRS was the worrying part...

That said I have 3 related questions to improve our understanding and handling of future similar events:

- It seems the auto-repair was not done during the scrubs. Reading again https://docs.ceph.com/en/pacific/rados/operations/pg-repair, I saw that if number of (scrub?) errors is greater than osd_scrub_auto_repair_num_errors is not done. We had between 20 and 30 scrub errors yesterday, after the broken OSD was removed from Ceph: may it be the reason explaining the absence of auto-repair? Is it a parameter that is better kept to its default (5) or will it make sense to increase it to 20 or so?

- I'm wondering if removing the broken OSD with 'ceph orch osd rm' was the right decision in our situation where we know where the errors were coming from and were confident that the other replicas (it is is replica 3 pool) were correct. If I'm right, 'ceph orch osd rm' implies a drain of the OSD. Before doing it, we set the primary affinity for this OSD to 0 but was it enough to prevent using it as a source in the resulting backfills? Wouldn't have been better to just stop the broken OSD to ensure this?

  - If we ensure that the broken OSD is not used as a source for the backfills, can we hope that the backfills clear the problems (as they replicated a good replica) even if the PG is still marked as possibly inconsistent and if the status will be good again after the next scrub+repair?

Best regards,

Michel

Le 18/09/2025 à 14:26, Enrico Bocchi a écrit :
Hello Michel,

I would not get worried unless pg repair completes (can take hours) but the pg is still marked as inconsistent afterwards. Naive question: Have you actually checked the failing drive is the one you removed? There should be a line in ceph log going as "cluster [ERR] <pg_number> shard <osd_id>".

Cheers,
Enrico


On 9/18/25 12:39, Janne Johansson wrote:
We experienced one week ago problems with a dying OSD resulting in
OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring
system is not notifying these errors properly). When we realized the
error, we removed the problematic OSD (ceph orch osd rm --replace),
despite the scrubbing errors: the resulting backfills succeeded but did
not fix the scrub errors. The collegue who took care of this problem
decided to lauch a `ceph pg repair` on the 3 PGs with reported
inconsistencies but it doesn't seem to converge. 'ceph -s' still reports :

               3 active+clean+scrubbing+deep+inconsistent+repair

after a few hours and for at least one of the PG, there is the following
message every 3s:

Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub
errors, upgrading scrub to deep-scrub

Not sure if it is the sign of a problem or just because the operation is
ongoing. I'm looking for advices on what to do to move forward. There
was not yet a report from users of an impact of this but it doesn't mean
there is none... The affected pool is storing RBD volumes (from
OpenStack Cinder).
Can't say how long it should take, but repairs can take a while. For
me, it's usually takes a long while until the
"active+clean+scrubbing+deep+inconsistent+repair" status appears, then
I guess it is dependant on disk (and possibly wpq -vs- mclock?) perf.
I would stay calm for a while to let the cluster try to get itself
right.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to