On Tue, May 14, 2013 at 06:35:26AM +0000, James Harper wrote:
>What can cause these unrecoverable read errors? Is losing power mid-write 
>enough to cause this to happen? Or maybe a knock while writing? I grabbed 
>these 1TB disks out of a few old PC's and NAS's I had lying around the place 
>so their history is entirely uncertain. I definitely can't tell if they were 
>already present when I started using ceph on them.

it's best to think of disks as analogue devices pretending to be
digital. often they can't read a marginal sector one day and then it's
fine again the next day. some sectors come and go like this
indefinitely, while others are bad enough that they're remapped and you
never have an issue with them again. if the disk as a whole is bad
enough then you run out of spare sectors to do remapping with, and the
disk is dead. in my experience disks usually become unusable (slow,
erratic, hangs drivers etc.) before they run out of spare sectors.

with todays disk capacities this is just what you have to expect and
software needs to be able to deal with it.

silent data corruption is a much much rarer and nastier problem, and is
the motivation behind the checksums in zfs, btrfs, xfs metadata etc.

>Is Linux MD software smart enough to rewrite a bad sector with good data to 
>clear this type of error (keeping track of error counts to know when to eject 
>the disk from the array)?

yes.

>What about btrfs/zfs? Trickier with something like ceph where ceph runs on top 
>of a filesystem which isn't itself redundant...

all raid-like things need to deal with the expected 1-10% of real disk
failures a year. depending on how they're implemented they could also
turn these soft recoverable disk semi-failing scenarios into just more
disk fails, or (like md does) try hard to recover the disk and data
in-situ by smart re-writing and timeouts. the problem with kicking out
at the first simple error is that full rebuilds involve lots of i/o and
so are asking for a second failure.
ideally it would be the call of user to tell the raid-like layer to try
hard or to just fail out straight away, depending on seriousness of
error, current redundancy level, disk characteristics, how valuable the
data is, if i/o is latency sensitive, if data is backed up, etc., but
that does seem quite complicated :-)

as ceph is pitched as being for non-raid devices I would assume ceph
must have 'filesystem gone read-only' detection (ie. the fs got a read
error from a disk) as well as a 'disk/node hung/stopped timeout'
detection. these are coarse but probably effective techniques.
hopefully they then have something automated to dd /dev/zero over disks
(and rebuild fs's and re-add to the active pool but on probation),
otherwise it'll be a lot of work to track down and do that to each disk
manually.

cheers,
robin
_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to