Hi,
Good news: the 3 repairs completed successfully a few hours after my
message. I was not worrying too much but preferred to check that we were
not missing something in this unusual situation. Generally we catch disk
problems before they too ill or they die suddenly.
The OSD_TOO_MANY_REPAIRS was the worrying part...
That said I have 3 related questions to improve our understanding and
handling of future similar events:
- It seems the auto-repair was not done during the scrubs. Reading again
https://docs.ceph.com/en/pacific/rados/operations/pg-repair, I saw that
if number of (scrub?) errors is greater than
osd_scrub_auto_repair_num_errors is not done. We had between 20 and 30
scrub errors yesterday, after the broken OSD was removed from Ceph: may
it be the reason explaining the absence of auto-repair? Is it a
parameter that is better kept to its default (5) or will it make sense
to increase it to 20 or so?
- I'm wondering if removing the broken OSD with 'ceph orch osd rm' was
the right decision in our situation where we know where the errors were
coming from and were confident that the other replicas (it is is replica
3 pool) were correct. If I'm right, 'ceph orch osd rm' implies a drain
of the OSD. Before doing it, we set the primary affinity for this OSD to
0 but was it enough to prevent using it as a source in the resulting
backfills? Wouldn't have been better to just stop the broken OSD to
ensure this?
- If we ensure that the broken OSD is not used as a source for the
backfills, can we hope that the backfills clear the problems (as they
replicated a good replica) even if the PG is still marked as possibly
inconsistent and if the status will be good again after the next
scrub+repair?
Best regards,
Michel
Le 18/09/2025 à 14:26, Enrico Bocchi a écrit :
Hello Michel,
I would not get worried unless pg repair completes (can take hours)
but the pg is still marked as inconsistent afterwards.
Naive question: Have you actually checked the failing drive is the one
you removed? There should be a line in ceph log going as "cluster
[ERR] <pg_number> shard <osd_id>".
Cheers,
Enrico
On 9/18/25 12:39, Janne Johansson wrote:
We experienced one week ago problems with a dying OSD resulting in
OSD_TOO_MANY_REPAIRS, unfortunately unnoticed (it seems our monitoring
system is not notifying these errors properly). When we realized the
error, we removed the problematic OSD (ceph orch osd rm --replace),
despite the scrubbing errors: the resulting backfills succeeded but did
not fix the scrub errors. The collegue who took care of this problem
decided to lauch a `ceph pg repair` on the 3 PGs with reported
inconsistencies but it doesn't seem to converge. 'ceph -s' still
reports :
3 active+clean+scrubbing+deep+inconsistent+repair
after a few hours and for at least one of the PG, there is the
following
message every 3s:
Sep 18 12:30:55 ceph-76212 ceph-mon[2506]: osd.72 pg 11.e2d Deep scrub
errors, upgrading scrub to deep-scrub
Not sure if it is the sign of a problem or just because the
operation is
ongoing. I'm looking for advices on what to do to move forward. There
was not yet a report from users of an impact of this but it doesn't
mean
there is none... The affected pool is storing RBD volumes (from
OpenStack Cinder).
Can't say how long it should take, but repairs can take a while. For
me, it's usually takes a long while until the
"active+clean+scrubbing+deep+inconsistent+repair" status appears, then
I guess it is dependant on disk (and possibly wpq -vs- mclock?) perf.
I would stay calm for a while to let the cluster try to get itself
right.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io