On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote: > Also, is there RMW with raid0, or even raid10?
No. Mirroring is writing the same data in two isolated places. Striping is writing data at different isolated places. No matter which sectors you write through these layers, it does not affect the correctness of data in any sector at a different logical address. None of these use RMW--you read or write only complete sectors and act only on the specific sectors requested. Only parity RAID does RMW. e.g. in RAID0, when you modify block 47, you may actually modify block 93 on a different disk, but there's always a 1:1 mapping between every logical and physical address. If there is a crash we go back to an earlier tree that does not contain block 47/93 so we don't care if the write was interrupted. e.g. in RAID1, when you modify block 47, you modify physical block 47 on two separate disks. The state of disk1-block47 may be different from the state of disk2-block47 if the write is interrupted. If there is a crash we go back to an earlier tree that does not contain either copy of block 47 so we don't care about any inconsistency there. So raid0, single, dup, raid1, and raid10 are OK--they fall into one or both of the above cases. CoW works there. None of these properties change in degraded mode with the mirroring profiles. Parity RAID is writing data in non-isolated places. When you write to some sectors, additional sectors are implicitly modified in degraded mode (whether you are in degraded mode at the time of the writes or not). This is different from the other cases because the other cases never modify any sectors that were not explicitly requested by the upper layer. This is OK if and only if the CoW layer is aware of this behavior and works around it. > Or is that always CoW > for metadata and data, just like single and dup? It's always CoW at the higher levels, even for parity RAID. The problem is that the CoW layer is not aware of the RMW behavior buried in the parity RAID layer, so the combination doesn't work properly. CoW thinks it's modifying only block 47, when in fact it's modifying an entire stripe in degraded mode. Let's assume 5-disk RAID5 with a strip size of one block for this example, and say blocks 45-48 are one RAID stripe. If there is a crash, data in blocks 45, 46, 47, and 48 may be irretrievably damaged by inconsistent modification of parity and data blocks. When we try to go back to an earlier tree that does not contain block 47, we will end up with a tree that contains corruption in one of the blocks 45, 46, or 48. This corruption will only be visible when something else goes wrong (parity mismatch, data csum failure, disk failure, or scrub) so a damaged filesystem that isn't degraded could appear to be healthy for a long time. If the CoW layer is aware of this, it can arrange operations such that no stripe is modified while it is referenced by a committed tree. Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at block 49 instead of 47. Since blocks 50-52 contain no data we care about, we don't even have to bother reading them (just fill the other blocks with zero or find some other data to write in the same commit), and we can eliminate many slow RMW operations entirely*. If there is a crash we just fall back to an earlier tree that does not contain block 49. This tree is not damaged because we left blocks 45-48 alone. One way to tell this is done right is all data in each RAID stripe will always belong to exactly zero or one transaction, not dozens of different transactions as stripes have now. The other way to fix things is to make stripe RMW atomic so that CoW works properly. You can tell this is done right if you can find a stripe update journal in the disk format or the code. > If raid0 is always > CoW, then I don't think it's correct to consider raid5 minus parity to > be anything like raid0 - in a Btrfs context anyway. Outside of that > context, I understand the argument. > > > > -- > Chris Murphy [*] We'd still need parity RAID RMW for nodatacow and PREALLOC because neither uses the CoW layer. That doesn't matter for nodatacow because nodatacow is how users tell us they don't want to read their data any more, but it has interesting implications for PREALLOC. Maybe a solution for PREALLOC is to do the first write strictly in RAID-stripe-sized units?
signature.asc
Description: Digital signature