FYI ---------- Forwarded message ---------- From: Zenaan Harkness Date: Wed, 10 Jun 2015 11:50:48 +1000 Subject: 8TiB HDD, 10^14 bit error rate, approaching certainty of error for each "drive of data" read To: d-community-offto...@lists.alioth.debian.org
Seems ZFS' and BTRFS' time has come. ZFS on Linux (ZFSoL) seems more stable to me, and has 10 years of deployment under its belt too. Any news on Debian GNU/Linux distributing ZFSoL? We see ZFS on Debian GNU/kFreeBSD being distributed by Debian... FYI Zenaan ---------- Forwarded message ---------- From: Zenaan Harkness Date: Tue, 26 May 2015 20:31:41 +1000 Subject: Re: Thank Ramen for ddrescue!!! On 5/25/15, Michael wrote: > The LVM volumes on the external drives are ok. Reminds me, also that I've been reading heaps about zfs over the last couple days, HDD error rates are close to biting us with current gen filesystems (like ext4). Armour plate your arse with some ZFS- or possibly the less battle tested BTRFS- armour. At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive (most consumer drives are 10^14 - one advertises 2^15, and enterprise drives are usually 2^16), we're talking 1 bit flip, on average, in 10^14 bits read, whilst: 8TiB drive = 8 * 1024^4 * 8bits = 70368744177664 bits So if we read each bit once, say in a mirror recovery/ disk rebuild situation, where that mirror disk has failed and a new one has been connected and refilled with the data of the sole surviving disk, there is an (8 * 1024^4 * 8) / 10^14, or ~70% chance that that "whole disk read" (of the "good" disk) will itself produce an unrecoverable bit-flip error, and so if you're using RAID hardware, you're now officially rooted - you can't rebuild your mirror (RAID1) disk array. Now think about a 4-disk (8TiB disks) RAID5 array (one parity disk), and it's as good as an absolute certainty that when (not if) one disk fails in that array, you will simply never recover/ rebuild the array, due to one of the remaining disks producing its own error - and at the point the first drive fails, the remaining drives are quite likely closer to failure anyway... Concerning stuff for data junkies like myself. Thus RAID6, RAID7, or better yet the ZFS solutions to this problem - RAIDZ2 and RAIDZ3 - where you have 2 or 3 parity disks respectively and funky ZFS magic built in (disk scrubbing, hot spare disks and more, all on commodity consumer disks and dumb controllers), where -any- 2 (or 3) disks in your "raid" set can fail, and the set can still rebuild itself - or if it's just sectors failing (random bit flips), ZFS will automatically detect and repair those sectors with bit flips, and warn you in the logs that this is happening - and it will otherwise keep using a drive that's on the way out until you replace it. See here to wake us all up: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/ (That second article slags ZFS with (what seems to me as) a claim that ZFS COW (copy on write) functionality is per-file, not per-block, which AIUI is total bollocks - ZFS most certainly is a per-block COW filesystem, not per-file, but that's just a reflection of the bold assumptions and lack of fact checking of that article's author - otherwise I think the article is useful!) Z ---------- Forwarded message ---------- From: Zenaan Harkness Date: Tue, 26 May 2015 22:34:50 +1000 Subject: Re: Thank Ramen for ddrescue!!! > On 26 May 2015 12:31, "Zenaan Harkness" wrote: >> Reminds me, also that I've been reading heaps about zfs over the last >> couple days, HDD error rates are close to biting us with current gen >> filesystems (like ext4). Armour plate your arse with some ZFS- or >> possibly the less battle tested BTRFS- armour. >> >> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive >> (most consumer drives are 10^14 - one advertises 2^15, and enterprise >> drives are usually 2^16), we're talking 1 bit flip, on average, in >> 10^14 bits read, whilst: >> > > Base 10 or base 2? It's an order of magnitude of difference here, or one > thousand more errors, so kinda a big deal... Base 10. And the difference is much more than an order of magnitude: 2^14 = 16384 10^14 = 100000000000000 Unless I'm not understanding what you're asking... For current HDDs: 10^15 URE rate means an order of magnitude less likely to have a problem. 10^16, one O better again. The problem is, 10^14, with a 10T drive, is now at certainty - you are all but guaranteed an random unrecoverable read error on that drive, every time you read it - or rather, everytime you read a drives worth of data off of that drive, which could be "quite a bit worse in practice" depending on your usage environment for the drive. I believe the URE rate's been roughly the same since forever - the only "problem" is that we've gone from 10MB drives, to (very soon) 10TB drives - i.e. 6 orders of magnitude increase in storage capacity, with no corresponding improvement in the read error rate, or in that ballpark anyway. Z ---------- Forwarded message ---------- From: Zenaan Harkness Date: Wed, 27 May 2015 00:34:44 +1000 Subject: Re: Thank Ramen for ddrescue!!! > On 05/26/2015 08:45 AM, Zenaan Harkness wrote: >> ZFS is f*ing awesome! Even for a single drive that's large enough to >> guarantee errors, ZFS makes the pain go away. I think BTRFS is >> designed to have similar functionality - but it's got a ways to go yet >> on various fronts, even though ultimately it may end up a "better" >> filesystem than ZFS (but who knows). >> >> Z >> I guess that's Z for ZFS then ehj? :) > > What about XFS?? It's being recommended on the Proxmox list as requiring > less memory. I know next to nothing about this. Ric Yesterday I read that that's a long standing falsity about ZFS - the only situation in ZFS where RAM becomes significant (for performance) is in Data deduplication - which is different again from COW and its benefits. See here: http://en.wikipedia.org/wiki/ZFS#Deduplication These days an SSD for storing the deduplication tables is an easy way to handle this situation if memory (and performance) is precious in your deployment [[and you want to enable deduplication]]. Either way, it appears just about everything including memory use is configurable - so it would make sense to get at least a little familiar with it if you made your root filesystem ZFS. I can't speak to XFS - it may be better for a single user workstation root drive, I don't know sorry. I do know that for large disks (by today's standards), ZFS nails the "certainty of bitrot" problem - which, if one's data or photos or whatever is precious, is probably significant no matter how small the storage is, just that with a small dataset, it's easy to duplicate manually, but even then, automatically provided (e.g. ZFS time-period scrubbing) is less error prone than manual backups of course [[when combined with some form of ZFS ZRAID]]. These pages seemed quite useful yesterday: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ https://calomel.org/zfs_raid_speed_capacity.html http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Z ---------- Forwarded message ---------- From: Zenaan Harkness Date: Wed, 27 May 2015 00:46:29 +1000 Subject: Re: Thank Ramen for ddrescue!!! > On 26 May 2015 14:34, "Zenaan Harkness" <z...@freedbms.net> wrote: >> > On 26 May 2015 12:31, "Zenaan Harkness" <z...@freedbms.net> wrote: >> >> Reminds me, also that I've been reading heaps about zfs over the last >> >> couple days, HDD error rates are close to biting us with current gen >> >> filesystems (like ext4). Armour plate your arse with some ZFS- or >> >> possibly the less battle tested BTRFS- armour. >> >> >> >> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive >> >> (most consumer drives are 10^14 - one advertises 2^15, and enterprise >> >> drives are usually 2^16), we're talking 1 bit flip, on average, in >> >> 10^14 bits read, whilst: >> > >> > Base 10 or base 2? It's an order of magnitude of difference here, or >> > one >> > thousand more errors, so kinda a big deal... >> >> Base 10. And the difference is much more than an order of magnitude: >> 2^14 = 16384 >> 10^14 = 100000000000000 >> >> Unless I'm not understanding what you're asking... > > You've used both bases in your post, and it's not clear whether you meant > that or it was a typo. Indeed. The numbers are staggering. And the fact we can now buy consumer 8TB drives, which essentially guarantee the buyer a bit flip on reading (and or bit rot as stored) every drive's worth of data is really mind blowing - also that such error-guarantees are not yet widely discussed or realized - I guess the "average home user" just dumps photos, music and movies on their drives, and relatively rarely reads them back off, and so the awareness is just not there. And up until yesterday I've been an average home user from a drive URE rate perspective - been all but oblivious. It's sorta been like "oh yeah, I know they include error rates if you look at the specs, but this is like, you know, an engineered product, and products have you know, at least one year warranties, and it's all engineering tolerances and stuff and those engineers know what they're doing, so I don't have to worry. Right? Well, turns out we need to worry, and in fact these bit flips are now all but a certainty. There's the odd web page about where a fastidious individual has kept a record over the years of corrupt files. Those error rates are actual - neither optimistic nor pessimistic it seems. Of course they're average and they're rates, but from everything I've read in the last two days, they're relatively accurate engineering guarantees. It used to be a guarantee that you would get no bit flips, on average, except if you'd read/written simply enormous amounts. Now that engineering amount is equal to about one (large) drive of data! I just keep shaking my head, having never realized the significance of all this prior to, oh idk, roughly say, yesterday. Might have been about 11pm. Although it's now tomorrow, so if my engineering calculations are right, that may have actually been the day before. I think I need sleep. :) Z -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html