On Thu, Oct 13, 2016 at 12:33:31AM +0500, Roman Mamedov wrote:
> On Wed, 12 Oct 2016 15:19:16 -0400
> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> > I'm not even sure btrfs does this--I haven't checked precisely what
> > it does in dup mode.  It could send both copies of metadata to the
> > disks with a single barrier to separate both metadata updates from
> > the superblock updates.  That would be bad in this particular case.
> It would be bad in any case, including a single physical disk and no RAID, and

No, a single disk does not have these problems.  On a single disk we don't
have to deal with temporarily corrupted metadata _outside_ the areas we
are writing, as the disk will confine damaged data to individual sectors.
On RAID5, data damage is only limited at the stripe level, a unit orders
of magnitude larger than a sector.

> I don't think there's any basis to speculate that mdadm doesn't implement
> write barriers properly.

btrfs and mdadm have to use them properly together.  It's possible to
get it fatally wrong from the btrfs side even if mdadm does everything
perfectly.  Single disks don't have stripe consistency requirements,
so if btrfs has single-disk assumptions about the behavior of writes
then it can do the wrong thing on multi-disk systems.

> > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
> > is an interruption (system crash, a disk times out, etc) in degraded mode,
> Moreover, in any non-COW system writes temporarily corrupt data. So again,
> writing to a (degraded or not) mdadm RAID5 is not much different than writing
> to a single physical disk. However I believe in the Btrfs case metadata is
> always COW, so this particular problem may be not as relevant here in the
> first place.

Degraded RAID5 does not behave like a single disk.  That's the point
people seem to keep missing when thinking about this.  btrfs CoW relies
on single-disk behavior, and fails badly when it doesn't get it.

btrfs CoW requires that writes to one sector don't modify or jeopardize
data integrity in any other sectors.  mdadm in degraded raid5/6 mode with
no stripe journal device cannot deliver this requirement.  Writes always
temporarily disrupt data on other disks in the same RAID stripe.  Each
individual disruption lasts only milliseconds, but there may be hundreds
or thousands of failure windows per second.

> -- 
> With respect,
> Roman

Attachment: signature.asc
Description: Digital signature

Reply via email to