Re: List of known BTRFS Raid 5/6 Bugs?

Duncan Sat, 08 Sep 2018 01:43:32 -0700

Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:

> sorry for disturb this discussion,
> 
> are there any plans/dates to fix the raid5/6 issue? Is somebody working
> on this issue? Cause this is for me one of the most important things for
> a fileserver, with a raid1 config I loose to much diskspace.


There's a more technically complete discussion of this in at least two 
earlier threads you can find on the list archive, if you're interested, 
but here's the basics (well, extended basics...) from a btrfs-using-
sysadmin perspective.

"The raid5/6 issue" can refer to at least three conceptually separate 
issues, with different states of solution maturity:

1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
the historic) in current kernels and tools.  Unfortunately these will 
still affect for some time many users of longer-term stale^H^Hble distros 
who don't update using other sources for some time, as because the raid56 
feature wasn't yet stable at the lock-in time for whatever versions they 
stabilized on, they're not likely to get the fixes as it's new-feature 
material.

If you're using a current kernel and tools, however, this issue is 
fixed.  You can look on the wiki for the specific versions, but with the 
4.18 kernel current latest stable, it or 4.17, and similar tools versions 
since the version numbers are synced, are the two latest release series, 
with the two latest release series being best supported and considered 
"current" on this list.

Also see...

2) General feature maturity:  While raid56 mode should be /reasonably/ 
stable now, it remains one of the newer features and simply hasn't yet 
had the testing of time that tends to flush out the smaller and corner-
case bugs, that more mature features such as raid1 have now had the 
benefit of.

There's nothing to do for this but test, report any bugs you find, and 
wait for the maturity that time brings.

Of course this is one of several reasons we so strongly emphasize and 
recommend "current" on this list, because even for reasonably stable and 
mature features such as raid1, btrfs itself remains new enough that they 
still occasionally get latent bugs found and fixed, and while /some/ of 
those fixes get backported to LTS kernels (with even less chance for 
distros to backport tools fixes), not all of them do and even when they 
do, current still gets the fixes first.

3) The remaining issue is the infamous parity-raid write-hole that 
affects all parity-raid implementations (not just btrfs) unless they take 
specific steps to work around the issue.

The first thing to point out here again is that it's not btrfs-specific.  
Between that and the fact that it *ONLY* affects parity-raid operating in 
degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
be argued not to be a btrfs issue at all, but rather one inherent to 
parity-raid mode and considered an acceptable risk to those choosing 
parity-raid because it's only a factor when operating degraded, if an 
ungraceful shutdown does occur.

But btrfs' COW nature along with a couple technical implementation 
factors (the read-modify-write cycle for incomplete stripe widths and how 
that risks existing metadata when new metadata is written) does amplify 
the risk somewhat compared to that seen with the same write-hole issue in 
various other parity-raid implementations that don't avoid it due to 
taking write-hole avoidance countermeasures.


So what can be done right now?

As it happens there is a mitigation the admin can currently take -- btrfs 
allows specifying data and metadata modes separately, and even where 
raid1 loses too much space to be used for both, it's possible to specify 
data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
loss of a single device, it doesn't have the parity-raid write-hole as 
it's not parity-raid, and for most use-cases at least, specifying raid1 
for metadata only, while raid5 for data, should strictly limit both the 
risk of the parity-raid write-hole as it'll be limited to data which in 
most cases will be full-stripe writes and thus not subject to the 
problem, and the size-doubling of raid1 as it'll be limited to metadata.

Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
first rule of backups, that the true value of data isn't defined by 
arbitrary claims, but by the number of backups it is considered worth the 
time/trouble/resources to have of that data, it's a known parity-raid 
risk specifically limited to the corner-case of having an ungraceful 
shutdown *WHILE* already operating degraded, and as such, it can be 
managed along with all the other known risks to the data, including admin 
fat-fingering, the risk that more devices will go out than the array can 
tolerate, the risk of general bugs affecting the filesystem or other 
storage-function related code, etc.

IOW, in the context of the admin's first rule of backups, no matter the 
issue, raid56 write hole or whatever other issue of the many possible, 
loss of data can *never* be a particularly big issue, because by 
definition, in *all* cases, what was of most value was saved, either the 
data if it was defined as valuable enough to have a backup, or the time/
trouble/resources that would have otherwise gone into making that backup, 
if the data wasn't worth it to have a backup.

(One nice thing about this rule is that it covers the loss of whatever 
number of backups along with the working copy just as well as it does 
loss of just the working copy.  No matter the number of backups, the 
value of the data is either worth having one more backup, just in case, 
or it's not.  Similarly, the rule covers the age of the backup and 
updates nicely as well, as that's just a subset of the original problem, 
with the value of the data in the delta between the last backup and the 
working copy now being the deciding factor, either the risk of losing it 
is worth updating the backup, or not, same rule, applied to a data 
subset.)

So from an admin's perspective, in practice, while not entirely stable 
and mature yet, and with the risk of the already-degraded crash-case 
corner-case that's already known to apply to parity-raid unless 
mitigation steps are taken, btrfs raid56 mode should now be within the 
acceptable risk range already well covered by the risk mitigation of 
following an appropriate backup policy, optionally combined with the 
partial write-hole-mitigation strategy of doing data as raid5/6, with 
metadata as raid1.


OK, but what is being done to better mitigate the parity-raid write-hole 
problem for the future, and when might we be able to use that mitigation?

There are a number of possible mitigation strategies, and there's 
actually code being written using one of them right now, tho it'll be (at 
least) a few kernel cycles until it's considered complete and stable 
enough for mainline, and as mentioned in #2 above, even after that it'll 
take some time to mature to reasonable stability.

The strategy being taken is partial-stripe-write logging.  Full stripe 
writes aren't affected by the write hole and (AFAIK) won't be logged, but 
partial stripe writes are read-modify-write and thus write-hole 
succeptible, and will be logged.  That means small files and 
modifications to existing files, the ends of large files, and much of the 
metadata, will be written twice, first to the log, then to the final 
location.  In the event of a crash, on reboot and mount, anything in the 
log can be replayed, thus preventing the write hole.

As for the log, it'll be written using a new 3/4-way-mirroring mode, 
basically raid1 but mirrored more than two ways (which current btrfs 
raid1 is limited to, even with more than two devices in the filesystem), 
thus handling the loss of multiple devices.

That's actually what's being developed ATM, the 3/4-way-mirroring mode, 
which will be available for other uses as well.

Actually, that's what I'm personally excited about, as years ago, when I 
first looked into btrfs, I was running older devices in mdraid's raid1 
mode, which does N-way mirroring.  I liked the btrfs data checksumming 
and scrubbing ability, but with the older devices I didn't trust having 
just two-way-mirroring and wanted at least 3-way-mirroring, so back at 
that time I skipped btrfs and stayed with mdraid.  Later I upgraded to 
ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when 
one of the ssds ended up prematurely bad and needed replaced, I would 
have sure felt a bit better before I got the replacement done, if I still 
had good two-way-mirroring even with the bad device.

So I'm still interested in 3-way-mirroring and would probably use it for 
some things now, were it available and "stabilish", and I'm eager to see 
that code merged, not for the parity-raid logging it'll also be used for, 
but for the reliability of 3-way-mirroring.  Tho I'll probably wait at 
least 2-5 kernel cycles after introduction and see how it stabilizes 
before actually considering it stable enough to use myself, because even 
tho I do follow the backups policy above, just because I'm not 
considering the updated-data delta worth an updated backup yet, doesn't 
mean I want to unnecessarily risk having to redo the work since the last 
backup, which means choosing the newer 3-way-mirroring over the more 
stable and mature existing raid1 2-way-mirroring isn't going to be worth 
it to me until the 3-way-mirroring has at least a /few/ kernel cycles to 
stabilize.

And I'd recommend the same caution with the new raid5/6 logging mode 
built on top of that multi-way-mirroring, once it's merged as well.  
Don't just jump on it immediately after merge unless you're deliberately 
doing so to help test for bugs and get them fixed and the feature 
stabilized as soon as possible.  Wait a few kernel cycles, follow the 
list to see how the feature's stability is coming, and /then/ use it, 
after factoring in its remaining then still new and less mature 
additional risk into your backup risks profile, of course.

Time?  Not a dev but following the list and obviously following the new 3-
way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
longer), probably another couple cycles (if all goes well) after that for 
the parity-raid logging code built on top of the new mirroring modes, so 
perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
however many cycles until you think it has stabilized.  Call that another 
year.  So say about 10 kernel cycles or two years.  It could be a bit 
less than that, say 5-7 cycles, if things go well and you take it before 
I'd really consider it stable enough to recommend, but given the 
historically much longer than predicted development and stabilization 
times for raid56 already, it could just as easily end up double that, 4-5 
years out, too.

But raid56 logging mode for write-hole mitigation is indeed actively 
being worked on right now.  That's what we know at this time.

And even before that, right now, raid56 mode should already be reasonably 
usable, especially if you do data raid5/6 and metadata raid1, as long as 
your backup policy and practice is equally reasonable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

Reply via email to