Re: BTRFS state on kernel 5.2

Remi Gauvin Mon, 02 Sep 2019 17:11:00 -0700

On 2019-09-02 1:21 p.m., waxhead wrote:

> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?


If you are adding a new device of same size or larger than the device
you are replacing, do  not do balances.. you can still just do a device
replace.  The only difference is, if the failed device is missing
entirely, you have to specify the device id of the missing device,
(rather than a /dev/sd?)


> 
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.

The Parity data from an interrupted write will be missing/corrupt.  This
will in turn affect old data, not just the data you were writing.  The
write hole will only be of consequence if you are reading the array
degraded, (ie, a drive is failed/missing, and though unlikely, would
also be a problem if you just happen to suddenly have a bad sector in
the same range of data as the corrupt parity).

If the corrupted data affects metadata, the consequences can be anything
from minor to completely unreadable filesystem.

If it affects data blocks, some files will be unreadable, but they can
simply be deleted/restored from backup.

As you noted, Metadata can be made Raid1, which will at least prevent
complete filesystem meltdown from write hole.  But until the patches
increase the number of devices in a Mirrored Raid, there is not way to
make the pool tolerant of 2 device failuers, so Raid 6 is mostly
useless..  (Arguably, Raid 6 data would be much more likely to recover
from an unreadable sector while recovering from a missing device.)

It's also important to understand that unlike most other (all other?)
raid implementations, BTRFS will not, by itself, fix parity when it
restarts after an unclean shutdown.  It's up to the administrator to run
a scrub manually..  Otherwise, parity errors will accumulate with each
unclean shutdown, and in turn, result in unrecoverable data if the array
is later degraded.

> 
> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)
>

AFAIK, checksum and Nocow files is technically not possible, so no plans
exist to even try adding that functionality.   If you think about it,
any unclean filesystem stop while nocow data is being written would
result in inconsistent checksum, so it would self defeating.

As for the Nocow problems, it has to do with Mirrored Raid.  Without COW
or checksums, BTRFS has no method whatsoever of keeping Raid mirrored
data consistent.  In the case of unclean stop while data is being
written, the two copies will be different, and which data gets read at
any time is entirely up to the fates.  Not only will BTFS not
synchronize the mirrored copies by itself on next boot, it won't even
fix it in a scrub.

This behaviour, as you noted, is still undocumented after my little
outburst a few months back.  IMO, it's pretty bad.

signature.asc
Description: OpenPGP digital signature

Re: BTRFS state on kernel 5.2

Reply via email to