Re: MD RAID 5/6 vs BTRFS RAID 5/6

Chris Murphy Thu, 17 Oct 2019 08:57:52 -0700

On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonl...@gmail.com> wrote:
>
> It would be interesting to know the pros and cons of this setup that
> you are suggesting vs zfs.
> +zfs detects and corrects bitrot (
> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> +zfs has working raid56
> -modules out of kernel for license incompatibilities (a big minus)
>
> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> to find any conclusive doc about it right now)


Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.

> I'm one of those that is waiting for the write hole bug to be fixed in
> order to use raid5 on my home setup. It's a shame it's taking so long.

For what it's worth, the write hole is considered to be rare.
https://lwn.net/Articles/665299/

Further, the write hole means a) parity is corrupt or stale compared
to data stripe elements which is caused by a crash or powerloss during
writes, and b) subsequently there is a missing device or bad sector in
the same stripe as the corrupt/stale parity stripe element. The effect
of b) is that reconstruction from parity is necessary, and the effect
of a) is that it's reconstructed incorrectly, thus corruption. But
Btrfs detects this corruption, whether it's metadata or data. The
corruption isn't propagated in any case. But it makes the filesystem
fragile if this happens with metadata. Any parity stripe element
staleness likely results in significantly bad reconstruction in this
case, and just can't be worked around, even btrfs check probably can't
fix it. If the write hole problem happens with data block group, then
EIO. But the good news is that this isn't going to result in silent
data or file system metadata corruption. For sure you'll know about
it.

This is why scrub after a crash or powerloss with raid56 is important,
while the array is still whole (not degraded). The two problems with
that are:

a) the scrub isn't initiated automatically, nor is it obvious to the
user it's necessary
b) the scrub can take a long time, Btrfs has no partial scrubbing.

Wheras mdadm arrays offer a write intent bitmap to know what blocks to
partially scrub, and to trigger it automatically following a crash or
powerloss.

It seems Btrfs already has enough on-disk metadata to infer a
functional equivalent to the write intent bitmap, via transid. Just
scrub the last ~50 generations the next time it's mounted. Either do
this every time a Btrfs raid56 is mounted. Or create some flag that
allows Btrfs to know if the filesystem was not cleanly shutdown. It's
possible 50 generations could be a lot of data, but since it's an
online scrub triggered after mount, it wouldn't add much to mount
times. I'm also picking 50 generations arbitrarily, there's no basis
for that number.

The above doesn't cover the case where partial stripe write (which
leads to write hole problem), and a crash or powerloss, and at the
same time one or more device failures. In that case there's no time
for a partial scrub to fix the problem leading to the write hole. So
even if the corruption is detected, it's too late to fix it. But at
least an automatic partial scrub, even degraded, will mean the user
would be flagged of the uncorrectable problem before they get too far
along.


-- 
Chris Murphy

Re: MD RAID 5/6 vs BTRFS RAID 5/6

Reply via email to