On 14/05/13 16:35, James Harper wrote:
> I've had a few disks fail with uncorrectable read errors just recently, and 
> in the past my process is that any disk with any sort of error gets discarded 
> and replaced, especially in a server. I did some reading though (see previous 
> emails about SMART vs actual disk failures) and read that simply writing back 
> over those sectors is often enough to clear the error and allow them to be 
> remapped, possibly extending the life of the disk, depending on the cause of 
> the error.
>
> In actual fact after writing the entire failed disk with /dev/zero the other 
> day, all the SMART attributes are showing a healthy disk - no pending 
> reallocations and no reallocated sectors, yet, so maybe it wrote over the bad 
> sector and determined it was good again without requiring a remap. I'm 
> deliberately using some old hardware to test ceph to see how it behaves in 
> various failure scenarios, and has been pretty good so far despite 3 failed 
> disks over the few weeks I've been testing.
>
> What can cause these unrecoverable read errors? Is losing power mid-write 
> enough to cause this to happen? Or maybe a knock while writing? I grabbed 
> these 1TB disks out of a few old PC's and NAS's I had lying around the place 
> so their history is entirely uncertain. I definitely can't tell if they were 
> already present when I started using ceph on them.
>
> Is Linux MD software smart enough to rewrite a bad sector with good data to 
> clear this type of error (keeping track of error counts to know when to eject 
> the disk from the array)? What about btrfs/zfs? Trickier with something like 
> ceph where ceph runs on top of a filesystem which isn't itself redundant...

A while back, when 4096 byte sectors went native, I had a disk with a - 
I think it said CRC error - on one sector.  The interesting thing was 
when I read the sector with "dd conv=noerror" I got 4096 bytes, 7/8 of 
which was clearly valid directory info (NTFS) and 512 bytes were 
garbage.  Go figure.

Writing this sector back cleared the read error, but there was a bit of 
damage to the file system with 512 bytes of dud info.

Now to add to the strange error messages from drives, I'm getting this one:

  [  317.144766] EXT4-fs (sdb1): error count: 1
  [  317.144777] EXT4-fs (sdb1): initial error at 1345261136: 
ext4_find_entry:1209: inode 2
  [  317.144785] EXT4-fs (sdb1): last error at 1345261136: 
ext4_find_entry:1209: inode 2

sdb1 is mounted noatime, and this message turns up around the same time 
from boot.
Smart tests and file system checks pass, I guess I'll just have to dump 
the entire 1TB+ to /dev/null to see if that trips anything usefull.

_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to