Hi everyone. Help me settle a debate. My coworker is seeing OBJECT_UNFOUND and PG_DAMAGED (recovery_unfound) errors. We both agree they are caused by bad drives. The fix is to mark the drive as out, replace it and add it back in. Whenever we see this error on Ceph we see corresponding read errors on the physical drive.
I'm saying that even though the drive is bad since there are two more copies, only 1 of 3 drives has bad sectors preventing the data from being accessed which is what dmesg is showing ie critical medium error, dev sd..., sector 12345... There should not be OBJECT_UNFOUND since Ceph compares the remaining two copies and assuming the data matches, it should be able to recover on it's own and move the data to another PG or maybe OBJECT_UNFOUND and PG_DAMAGED are warnings not errors. My coworker is saying because the primary OSD responsible for coordinating the PG was the one which failed, and is the "source of truth" the cluster goes into error state. His argument doesn't make sense to me since there should be no single point of failure, but I'm also not sure about my argument since I don't know enough about how Ceph works under the hood. Thanks. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io