> 
> On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:
> >
> > > with disks (and raid arrays) of that size, you also have to be concerned
> > > about data errors as well as disk failures - you're pretty much
> > > guaranteed to get some, either unrecoverable errors or, worse, silent
> > > corruption of the data.
> >
> > Guaranteed over what time period?
> 
> any time period.  it's a function of the quantity of data, not of time.
> 
> > It's easy to fault your logic as I just did a full scan of my array
> > and it came up clean.
> 
> no, it's not.  your array scan checks for DISK errors.  It does not check for
> data corruption - THAT is the huge advantage of filesystems like ZFS and
> btrfs, they can detect and correct data errors

This is the md 'check' function that compares the two copies of the data 
together. If there was corruption in my RAID1 then it's incredibly unlikely 
that this corruption would have occurred on both disks and register as a match, 
at least from a disk based corruption issue.

> > If you say you are "guaranteed to get some" over, say, a 10 year
> > period, then I guess that's fair enough. But as you don't specify a
> > timeframe I can't really contest the point.
> 
> you seem to be confusing data corruption with MTBF or similar, it's
> not like that at all. it's not about disk hardware faults, it's about
> the sheer size of storage arrays these days making it a mathematical
> certainty that some corruption will occur - write errors due to, e.g.,
> random bit-flaps, controller brain-farts, firmware bugs, cosmic rays,
> and so on.
> 
> e.g. a typical quoted rating of 1 error per 10^14 bits is one error per
> 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least
> one error in the data.

Not according to my visible history of parity checks of the underlying data 
(when it was 4 x 1.5TB - last 3TB disk still on order). I will be monitoring it 
more closely now though!

> 
> > I can say though that I do monitor the SMART values which do track
> > corrected and uncorrected error rates, and by extrapolating those
> > figures I can say with confidence that there is not a guarantee of
> > unrecoverable errors.
> 
> smart values really only tell you about detected errors in the drive
> itself. they don't tell you *anything* about data corruption problems -
> for that, you actually need to check the data...and to check the data
> you need a redundant copy or copies AND a hash of what it's supposed to
> be.
> 

Not entirely true. It gives reports of correctable errors, 
first-read-uncorrectable errors that were correct on re-read, etc. For an 
undetected disk read error to occur (eg one that still passed ECC or whatever 
correction codes are used), there would need to be significant quantities of 
the former, statistically speaking.

I wonder if the undetected error rates differ with the 4K sector disks? That is 
supposed to be one of the other advantages.

Of course that still doesn't detect errors that occur beyond the disk (eg pci, 
controller or cabling), so I guess your point still stands.

> with mdadm, such errors can only be corrected if the data can be
> rewritten to the same sector or if the drive can remap a spare sector to
> that spot. with zfs, because it's a COW filesystem all that needs to be
> done is to rewrite the data.

Correct. It can be detected though.

> ...

Thanks for taking the time to write out that stuff about ZFS. I'm somewhat 
wiser about it all now :)

James
_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to