Re: List of known BTRFS Raid 5/6 Bugs?

Duncan Tue, 11 Sep 2018 19:00:34 -0700

Stefan K posted on Tue, 11 Sep 2018 13:29:38 +0200 as excerpted:

> wow, holy shit, thanks for this extended answer!
> 
>> The first thing to point out here again is that it's not
>> btrfs-specific.
> so that mean that every RAID implemantation (with parity) has such Bug?
> I'm looking a bit, it looks like that ZFS doesn't have a write hole.


Every parity-raid implementation that doesn't contain specific write-hole 
workarounds, yes, but some already have workarounds built-in, as btrfs 
will after the planned code is written/tested/merged/tested-more-broadly.

https://www.google.com/search?q=parity-raid+write-hole [1]

As an example, back some years ago when I was doing raid6 on mdraid, it 
had the write-hole problem and I remember reading about it at the time.  
However, right on the first page of hits for the above search...

LWN: A journal for MD/RAID5 : https://lwn.net/Articles/665299/

Seems md/raid5's write hole was (optionally) closed in kernel 4.4 with an 
optional journal device... preferably a fast ssd or nvram, to avoid 
performance issues, and mirrored, to avoid the journal itself being a 
single point of failure.

For me zfs is strictly an arm's-length thing, because if Oracle wanted to 
they could easily resolve the licensing thing as they own the code, but 
they haven't, which at this point can only be deliberate, and as I result 
I simply don't touch it.  That isn't to say I don't recommend it for 
those comfortable with or simply willing to overlook the licensing 
issues, however, because zfs remains the most mature Linux option for 
many of the same feature points that btrfs has, only at a lower maturity 
level.

But while I keep zfs at personal arm's length, from what I've picked up, 
I /believe/ zfs gets around the write-hole by doing strict copy-on-write 
combined with variable-length stripes -- unlike current btrfs, a stripe 
isn't always written as widely as possible, so for instance in a 20-
device raid5-alike they're able to do a 3-device and possibly even 2-
device "stripe", which then being entirely copy-on-write, avoids the read-
modify-write cycle of modified existing data that unless mitigated 
creates the parity-raid write-hole.

Variable-length stripes is actually one of the possible longer-term 
solutions already discussed for btrfs as well, but the logging/journalling 
solution seems to be what they've decided to implement first, and there's 
other tradeoffs to it (as discussed elsewhere).  Of course because as 
I've already explained I'm interested in the 3/4-way-mirroring option 
that would be used for the journal but also available to expand the 
current 2-way-raid1 mode to additional mirroring, this is absolutely fine 
with me! =:^)

> And
> it _only_ happens when the server has a ungraceful shutdown, caused by
> poweroutage? So that mean if I running btrfs raid5/6 and I've no
> poweroutages I've no problems?

Sort-of yes?

Keep in mind that power-outage isn't the /only/ way to have an ungraceful 
shutdown, just one of the most common.  Should the kernel crash or lock 
up for some reason, common examples include video and occasionally 
network driver bugs due to the direct access to hardware and memory they 
get, that can trigger an "ungraceful shutdown" as well, altho with care 
(basically always trying ssh-ing in for a remote shutdown if possible and/
or using alt-sysrq-reisub sequences on apparent lockups) it's often 
possible to prevent those being /entirely/ ungraceful at the hardware 
level, so it's not /quite/ as bad as an abrupt power outage or perhaps 
even worse a brownout that doesn't kill writes entirely but can at least 
theoretically trigger garbage scribbling in random device blocks.

So yes, sort-of, but it'd not just power-outages.

>>  it's possible to specify data as raid5/6 and metadata as raid1

> does some have this in production?

I'm sure people do.  (As I said I'm a raid1 guy here, even 3-way-
mirroring for some things were it possible, so no parity-raid at all for 
me personally.)

On btrfs, it is in fact the multi-device default and thus quite common to 
have data and metadata as different profiles.  The multi-device default 
for metadata if not specified is raid1, with single profile data.  So if 
you just specify raid5/6 data and don't specify metadata at all, you'll 
get exactly what was mentioned, raid5/6 data as specified, raid1 metadata 
as the unspecified multi-device default.

So were I to guess I'd guess that a lot of people who weren't paying 
attention when setting up but saying they have raid5/6, actually only 
have it for data, having not specified anything for metadata, so they got 
raid1 for it.


> ZFS btw have 2 copies of metadata by
> default, maybe it would also be an option or btrfs?

It actually sounds like they do hybrid raid, then, not just pure parity-
raid, but mirroring the metadata as well.  That would be in accord with a 
couple things I'd read about zfs but hadn't quite pursued to the logical 
conclusion, and would be what btrfs as already available does with raid5/6 
data and raid1 metadata.

> in this case you think 'btrfs fi balance start -mconvert=raid1
> -dconvert=raid5 /path ' is safe at the moment?

Provided you have backups in accordance with the "if it's more valuable 
than the time/trouble/resources for the backup, it's backed up" rule, and 
on current kernels, yes.

>> That means small files and modifications to existing files, the ends of
>> large files, and much of the metadata, will be written twice, first to
>> the log, then to the final location.

> that sounds that the performance will go down? So far as I can see btrfs
> can't beat ext4 or btrfs nor zfs and then they will made it even slower?

That's the effect journaling[2] partial-stripe-writes will have, yes.

However, parity-raid /always/ has a write-performance tradeoff, well 
either that or a space/organization-tradeoff if it does less than full-
width stripes, because traditional parity-raid /already/ has the read-
modify-write problem for partial-stripe-width writes (and partial-width-
stripe-writes for non-traditional solutions such as zfs a space layout 
efficiency problem), so lower small-write performance is already a 
tradeoff you're making choosing parity-raid in the first place, and 
journaling only accentuates it a bit, as the price paid for closing the 
write hole.

The performance issue was a big part of the reason I ended up switching 
from parity-raid to raid1, back in the day on mdraid.  And it turned out 
I was /much/ happier with raid1. which had much better performance than I 
had thought it would (the mdraid raid1 scheduler is recognized for its 
high efficiency read-scheduling and for parallel-write scheduling, so 
write latency is only about the same as writes to a single device, while 
many or large reads are smart-scheduled to parallelize across all 
mirrors).

(The other part of the reason I switched back to raid1 on mdraid was 
because I had rather naively thought on parity-raid the parity would be 
cross-checked in the standard read path, giving me integrity checking as 
well.  Turns out that's not the case; parity is only used for rebuilds in 
case of device-loss and isn't checked for normal reads, a great 
disappointment.  That's actually why I'm so looking forward to btrfs 3-
and-4-way mirroring, because btrfs already has full checksumming and 
routine checking on read, for data and metadata integrity, but currently 
only has two-way-mirroring, so if you're down a device and the copy on 
the remaining device is bad, you're just out of luck, whereas 3-way-
mirroring would let a device be bad and still give me a backup if one of 
the two remaining copies ended up failing checksum verification.  4-way-
mirroring would obviously add yet another copy, but 3-way is the sweet-
spot for me.)

The performance is why mdraid recommends putting the journal on a faster 
device (or better yet mirrors, avoiding the single point of failure of a 
single journal device), ssd or nvram, for performance reasons, turning a 
slow-down into a speedup due to the write-cache.  But btrfs doesn't have 
device-purpose-specification like that built-in yet, so it's either all 
devices, or use something like bcache with an ssd as the front device.  
(The ssd used as the bcache front device can be partitioned to allow a 
single ssd to cache multiple slower backing devices.)

OTOH, as stated it's only smaller less-than-stripe-width writes that will 
be affected.  As soon as you're writing more than stripe width, as with 
large files for data or for metadata when copying whole subdir trees, 
most of it will be full-stripe-writes and thus shouldn't have to (I'm not 
sure how it's actually going to be implemented) be logged/journalled.


Meanwhile, at least one of the other alternatives, less-than-full-width-
stripes, or writing partly empty full-width-stripes, as necessary, of 
course with COW so read-modify-write is entirely avoided, will likely 
eventually be available on btrfs as well.  But they have their own 
tradeoffs, faster initially than the logged/journalled solution, but less 
efficient initial space utilization, with a clean-up balance likely 
periodically required to rewrite all the short-stripes (either less than 
full width or partially empty) to full width.  So all possibilities have 
their tradeoffs, none are a "magic" solution that entirely does away with 
problems inherent to parity-raid, without tradeoffs of /some/ sort.

But zfs is already (optionally? I don't know) using these tradeoffs, and 
on mdraid there's options, and people often aren't even aware of the 
tradeoffs they're taking on those solutions, so... I suppose when it's 
all said and done the only people aware of the issues on btrfs are likely 
going to be the highly technical and case-optimizer crowds, too.  
Everyone else will probably just use the defaults and not even be aware 
of the tradeoffs they're making by doing so, as is already the case on 
mdraid and zfs.

---
[1] As I'm no longer running either mdraid or parity-raid, I've not 
followed this extremely closely, but writing this actually spurred me to 
google the problem and see when and how mdraid fixed it.  So the links 
are from that. =:^)

[2] Journalling/journaling, one or two Ls?  The spellcheck flags both and 
last I tried googling it the answer was inconclusive.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

Reply via email to