On 14/05/13 16:35, James Harper wrote: > I've had a few disks fail with uncorrectable read errors just recently, and > in the past my process is that any disk with any sort of error gets discarded > and replaced, especially in a server. I did some reading though (see previous > emails about SMART vs actual disk failures) and read that simply writing back > over those sectors is often enough to clear the error and allow them to be > remapped, possibly extending the life of the disk, depending on the cause of > the error. > > In actual fact after writing the entire failed disk with /dev/zero the other > day, all the SMART attributes are showing a healthy disk - no pending > reallocations and no reallocated sectors, yet, so maybe it wrote over the bad > sector and determined it was good again without requiring a remap. I'm > deliberately using some old hardware to test ceph to see how it behaves in > various failure scenarios, and has been pretty good so far despite 3 failed > disks over the few weeks I've been testing. > > What can cause these unrecoverable read errors? Is losing power mid-write > enough to cause this to happen? Or maybe a knock while writing? I grabbed > these 1TB disks out of a few old PC's and NAS's I had lying around the place > so their history is entirely uncertain. I definitely can't tell if they were > already present when I started using ceph on them. > > Is Linux MD software smart enough to rewrite a bad sector with good data to > clear this type of error (keeping track of error counts to know when to eject > the disk from the array)? What about btrfs/zfs? Trickier with something like > ceph where ceph runs on top of a filesystem which isn't itself redundant...
A while back, when 4096 byte sectors went native, I had a disk with a - I think it said CRC error - on one sector. The interesting thing was when I read the sector with "dd conv=noerror" I got 4096 bytes, 7/8 of which was clearly valid directory info (NTFS) and 512 bytes were garbage. Go figure. Writing this sector back cleared the read error, but there was a bit of damage to the file system with 512 bytes of dud info. Now to add to the strange error messages from drives, I'm getting this one: [ 317.144766] EXT4-fs (sdb1): error count: 1 [ 317.144777] EXT4-fs (sdb1): initial error at 1345261136: ext4_find_entry:1209: inode 2 [ 317.144785] EXT4-fs (sdb1): last error at 1345261136: ext4_find_entry:1209: inode 2 sdb1 is mounted noatime, and this message turns up around the same time from boot. Smart tests and file system checks pass, I guess I'll just have to dump the entire 1TB+ to /dev/null to see if that trips anything usefull. _______________________________________________ luv-main mailing list [email protected] http://lists.luv.asn.au/listinfo/luv-main
