Re: How does btrfs handle bad blocks in raid1?

Duncan Thu, 09 Jan 2014 07:16:53 -0800

Austin S Hemmelgarn posted on Thu, 09 Jan 2014 07:52:44 -0500 as
excerpted:

> On 2014-01-09 07:41, Duncan wrote:
>> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 +0000 as excerpted:
>> 
>>> If a [btrfs ]block is read and fails its checksum, then the other
>>> copy (in RAID-1) is checked and used if it's good. The bad copy is
>>> rewritten to use the good data.
>> 
>> This is why I'm so looking forward to the planned N-way-mirroring,
>> aka true-raid-1, feature, as opposed to btrfs' current 2-way-only
>> mirroring.  Having checksumming is good, and a second copy in case
>> one fails the checksum is nice, but what if they BOTH do? I'd love
>> to have the choice of (at least) three-way-mirroring, as for me
>> that seems the best practical hassle/cost vs. risk balance I
>> could get, but it's not yet possible. =:^(
>> 
> Just a thought, you might consider running btrfs on top of LVM in the
> interim, it isn't quite as efficient as btrfs by itself, but it does
> allow N-way mirroring (and the efficiency is much better now that they
> have switched to RAID1 as the default mirroring backend)

Except... AFAIK LVM is like mdraid in that regard -- no checksums, 
leaving the software entirely at the mercy of the hardware's ability to 
detect and properly report failure.

In fact, it's exactly as bad as that, since while both lvm and mdraid 
offer N-way-mirroring, they generally only fetch a single unchecksummed 
copy from whatever mirror they happen to choose to request it from, and 
use whatever they get without even a comparison againt the other copies 
to see if they match or majority vote on which is the valid copy if 
something doesn't match.  The ONLY way they know there's an error (unless 
the hardware reports one) at all is if a deliberate scrub is done.

And the raid5/6 parity-checking isn't any better, as while those parities 
are written, they're never checked or otherwise actually used except in 
recovery.  Normal read operation is just like raid0; only the device(s) 
containing the data itself is(are) read, no parity/checksum checking at 
all, even tho the trouble was taken to calculate and write it out.  When 
I had mdraid6 deployed and realized that, I switched back to raid1 (which 
would have been raid10 on a larger system), because while I considered 
the raid6 performance costs worth it for parity checking, they most 
definitely weren't once I realized all those calculates and writes were 
for nothing unless an actual device died, and raid1 gave me THAT level of 
protection at far better performance.

Which means neither lvm nor mdraid solve the problem at all.  Even btrfs 
on top of them won't solve the problem, while adding all sorts of 
complexity, because btrfs still has only the two-way check, and if one 
device gets corrupted in the underlying mirrors but another actually 
returns the data, btrfs will be entirely oblivious.

What one /could/ in theory do at the moment, altho it's hardly worth it 
due to the complexity[1] and the fact that btrfs itself is still a 
relatively immature filesystem under heavy development, and thus not 
suited to being part of such extreme solutions yet, is layered raid1 
btrfs on loopback over raid1 btrfs, say four devices, separate on-the-
hardware-device raid1 btrfs on two pairs, with a single huge loopback-
file on each lower-level btrfs, with raid1 btrfs layered on top of the 
loopback devices, too, manually creating an effective 4-device btrfs 
raid11.  Or use btrf raid10 at one or the other level and make it an 8-
device btrfs raid101 or raid110.  Tho as I said btrfs maturity level in 
general is a mismatch for such extreme measures, at present.  But in 
theory...

Zfs is arguably a more practically viable solution as it's mature and 
ready for deployment today, tho there's legal/license issues with the 
Linux kernel module and the usual userspace performance issues (tho the 
btrfs-on-loopback-on-btrfs solution above wouldn't be performance issue 
free either) with the fuse alternative.

I'm sure that's why a lot of folks needing multi-mirror checksum-verified 
reliability remain on Solaris/OpenIndiana/ZFS-on-BSD, as Linux simply 
doesn't /have/ a solution for that yet.  Btrfs /will/ have it, but as I 
explained, it's taking awhile.

---
[1] Complexity: Complexity can be the PRIMARY failure factor when an 
admin must understand enough about the layout to reliably manage recovery 
when they're already under the extreme pressure of a disaster recovery 
situation.  If complexity in even an otherwise 100% reliable solution is 
high enough an admin isn't confident of his ability to manage it, then 
the admin themself becomes the week link the the reliability chain!!  
That's the reason I tried and ultimately dropped lvm over mdraid here, 
since I couldn't be confident in my ability to understand both well 
enough to without admin error recover from disaster.  Thus, higher 
complexity really *IS* a SERIOUS negative in this sort of discussion, 
since it can be *THE* failure factor!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does btrfs handle bad blocks in raid1?

Reply via email to