Hello,

We have a small cluster of 44 OSDs across 4 servers.

A few times a week, ceph health reports a pg is inconsistent. Looking at the 
relevant OSD’s logs, it always says "head candidate had a read error”. No other 
info, i.e. it’s not that the digest is wrong, it just has an I/O error. It’s 
usually a different OSD each time, so it’s not a specific 
disk/controller/server.

Manually running a deep scrub on the pg succeeds, and ceph health goes back to 
normal.

As a test today, before scrubbing the pg I found the relevant file in 
/var/lib/ceph/osd/… and cat(1)ed it. The first time I ran cat(1) on it I got an 
Input/output error. The second time I did it, however, it worked fine.

These read errors are all on Samsung 850 Pro 2TB disks (journals are on 
separate enterprise SSDs). The SMART status on all of them are similar and show 
nothing out of the ordinary.

Has anyone else experienced anything similar? Is this just a curse of 
non-enterprise SSDs, or do you think there might be something else going on, 
e.g. could it be an XFS issue? Any suggestions as to what to look at would be 
welcome.

Many thanks,

Oliver.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to