Hi everyone.
Help me settle a debate.

My coworker is seeing
OBJECT_UNFOUND and PG_DAMAGED (recovery_unfound) errors.
We both agree they are caused by bad drives.
The fix is to mark the drive as out, replace it and add it back in.
Whenever we see this error on Ceph we see corresponding read errors on
the physical drive.

I'm saying that even though the drive is bad since there are two more copies,
only 1 of 3 drives has bad sectors preventing the data from being accessed
which is what dmesg is showing
ie critical medium error, dev sd..., sector 12345...
There should not be OBJECT_UNFOUND since Ceph compares the remaining
two copies and assuming the data matches,
it should be able to recover on it's own and move the data to another
PG or maybe OBJECT_UNFOUND and PG_DAMAGED are warnings not errors.

My coworker is saying because the primary OSD responsible for
coordinating the PG was the one which failed,
and is the "source of truth" the cluster goes into error state.

His argument doesn't make sense to me since there should be no single
point of failure,
but I'm also not sure about my argument since I don't know enough
about how Ceph works under the hood.

Thanks.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to