I've had a few disks fail with uncorrectable read errors just recently, and in 
the past my process is that any disk with any sort of error gets discarded and 
replaced, especially in a server. I did some reading though (see previous 
emails about SMART vs actual disk failures) and read that simply writing back 
over those sectors is often enough to clear the error and allow them to be 
remapped, possibly extending the life of the disk, depending on the cause of 
the error.

In actual fact after writing the entire failed disk with /dev/zero the other 
day, all the SMART attributes are showing a healthy disk - no pending 
reallocations and no reallocated sectors, yet, so maybe it wrote over the bad 
sector and determined it was good again without requiring a remap. I'm 
deliberately using some old hardware to test ceph to see how it behaves in 
various failure scenarios, and has been pretty good so far despite 3 failed 
disks over the few weeks I've been testing.

What can cause these unrecoverable read errors? Is losing power mid-write 
enough to cause this to happen? Or maybe a knock while writing? I grabbed these 
1TB disks out of a few old PC's and NAS's I had lying around the place so their 
history is entirely uncertain. I definitely can't tell if they were already 
present when I started using ceph on them.

Is Linux MD software smart enough to rewrite a bad sector with good data to 
clear this type of error (keeping track of error counts to know when to eject 
the disk from the array)? What about btrfs/zfs? Trickier with something like 
ceph where ceph runs on top of a filesystem which isn't itself redundant...

Thanks

James

_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to