Re: [RFC] Preliminary BTRFS Encryption

Zygo Blaxell Mon, 19 Sep 2016 18:10:28 -0700

On Mon, Sep 19, 2016 at 06:31:22PM -0400, Zygo Blaxell wrote:
> On Mon, Sep 19, 2016 at 04:25:55PM -0600, Chris Murphy wrote:
> > >> Files are modified by creating new extents (using parameters inherited
> > >> from the inode to fill in the extent attributes) and updating the inode
> > >> to refer to the new extent instead of the old one at the modified
> > >> offset. Cloned extents are references to existing extents associated
> > >> with a different inode or at a different place within the same inode (if
> > >> the extent is not compatible with the destination inode, clone fails
> > >> with an error).  A snapshot is an efficient way to clone an entire
> > >> subvol tree at once, including all inodes and attributes.
> > >
> > > There is the caveat of chattr +C, which would need hard-disabled for
> > > extent-level encryption (vs block level).
> > 
> > What about raid56 partial stripe writes? Aren't these effectively nocow?
> 
> Those are a straight-up bug that should be fixed.  They are mixing committed
> data with uncommitted data from two different transactions, and the stripe
> temporarily contains garbage.  Combine that with unclean shutdown in degraded
> mode and the data is gone.


A slightly more detailed answer:

nocow and raid56 partial stripe writes are different because nocow writes
won't corrupt unrelated extents, while raid56 partial stripe writes will.
They are entirely different classes of problem.

Even in non-degraded mode, an interrupted write to a modified stripe
is not recoverable from parity until after the parity is reconstructed
(e.g. by scrub or a later write to the stripe in non-degraded mode).

If one of the disks is significantly slower or has deeper queues than
the others, this could affect many extents, as btrfs could submit a
lot of writes to each disk and then wait for all the disks to finish
asynchronously, leaving a large time window for interruption.

If a disk fails after an unclean shutdown but before a scrub is complete,
data in all of the uncorrected stripes will be lost.  If the array enters
or is already in degraded mode during a write when an unclean shutdown
occurs, data will be lost immediately.

Users who don't scrub immediately after unclean shutdowns are sitting on
a ticking time bomb of corruption that explodes when a disk fails.

If this happens to data extents, only file data is lost.  If it happens
to metadata extents, the filesystem is severely damaged or destroyed
(more likely destroyed as the roots of the metadata trees are usually
the most recently written blocks).

mdadm avoids this by scrubbing immediately after an unclean shutdown
to minimize the vulnerable window (or using the new stripe journalling
feature), but it fails (causing severe filesystem damage) when there
are crashes in degraded mode.  ZFS avoids this using a combination of
dynamic stripe width to avoid failed devices and the ZIL journal.

The best thing to do is rework the raid56 layer (and probably some
higher layers in btrfs) until there are no further references to
raid56_rmw_stripe or async_rmw_stripe, then remove those functions and
never put them back.

signature.asc
Description: Digital signature

Re: [RFC] Preliminary BTRFS Encryption

Reply via email to