On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote:
> Also, is there RMW with raid0, or even raid10? 

No.  Mirroring is writing the same data in two isolated places.  Striping
is writing data at different isolated places.  No matter which sectors
you write through these layers, it does not affect the correctness of
data in any sector at a different logical address.  None of these use
RMW--you read or write only complete sectors and act only on the specific
sectors requested.  Only parity RAID does RMW.

e.g. in RAID0, when you modify block 47, you may actually modify block
93 on a different disk, but there's always a 1:1 mapping between every
logical and physical address.  If there is a crash we go back to an
earlier tree that does not contain block 47/93 so we don't care if the
write was interrupted.

e.g. in RAID1, when you modify block 47, you modify physical block 47 on
two separate disks.  The state of disk1-block47 may be different from
the state of disk2-block47 if the write is interrupted.  If there is a
crash we go back to an earlier tree that does not contain either copy
of block 47 so we don't care about any inconsistency there.

So raid0, single, dup, raid1, and raid10 are OK--they fall into one or
both of the above cases.  CoW works there.  None of these properties
change in degraded mode with the mirroring profiles.

Parity RAID is writing data in non-isolated places.  When you write to
some sectors, additional sectors are implicitly modified in degraded mode
(whether you are in degraded mode at the time of the writes or not).
This is different from the other cases because the other cases never
modify any sectors that were not explicitly requested by the upper layer.
This is OK if and only if the CoW layer is aware of this behavior and
works around it.

> Or is that always CoW
> for metadata and data, just like single and dup? 

It's always CoW at the higher levels, even for parity RAID.  The problem
is that the CoW layer is not aware of the RMW behavior buried in the
parity RAID layer, so the combination doesn't work properly.

CoW thinks it's modifying only block 47, when in fact it's modifying
an entire stripe in degraded mode.  Let's assume 5-disk RAID5 with a
strip size of one block for this example, and say blocks 45-48 are one
RAID stripe.  If there is a crash, data in blocks 45, 46, 47, and 48
may be irretrievably damaged by inconsistent modification of parity and
data blocks.  When we try to go back to an earlier tree that does not
contain block 47, we will end up with a tree that contains corruption in
one of the blocks 45, 46, or 48.  This corruption will only be visible
when something else goes wrong (parity mismatch, data csum failure,
disk failure, or scrub) so a damaged filesystem that isn't degraded
could appear to be healthy for a long time.

If the CoW layer is aware of this, it can arrange operations such
that no stripe is modified while it is referenced by a committed tree.
Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at
block 49 instead of 47.  Since blocks 50-52 contain no data we care about,
we don't even have to bother reading them (just fill the other blocks
with zero or find some other data to write in the same commit), and we
can eliminate many slow RMW operations entirely*.  If there is a crash
we just fall back to an earlier tree that does not contain block 49.
This tree is not damaged because we left blocks 45-48 alone.

One way to tell this is done right is all data in each RAID stripe will
always belong to exactly zero or one transaction, not dozens of different
transactions as stripes have now.

The other way to fix things is to make stripe RMW atomic so that CoW
works properly.  You can tell this is done right if you can find a stripe
update journal in the disk format or the code.

> If raid0 is always
> CoW, then I don't think it's correct to consider raid5 minus parity to
> be anything like raid0 - in a Btrfs context anyway. Outside of that
> context, I understand the argument.
> -- 
> Chris Murphy

[*] We'd still need parity RAID RMW for nodatacow and PREALLOC because
neither uses the CoW layer.  That doesn't matter for nodatacow because
nodatacow is how users tell us they don't want to read their data any
more, but it has interesting implications for PREALLOC.  Maybe a solution
for PREALLOC is to do the first write strictly in RAID-stripe-sized units?

Attachment: signature.asc
Description: Digital signature

Reply via email to