Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted: > sorry for disturb this discussion, > > are there any plans/dates to fix the raid5/6 issue? Is somebody working > on this issue? Cause this is for me one of the most important things for > a fileserver, with a raid1 config I loose to much diskspace.
There's a more technically complete discussion of this in at least two earlier threads you can find on the list archive, if you're interested, but here's the basics (well, extended basics...) from a btrfs-using- sysadmin perspective. "The raid5/6 issue" can refer to at least three conceptually separate issues, with different states of solution maturity: 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus the historic) in current kernels and tools. Unfortunately these will still affect for some time many users of longer-term stale^H^Hble distros who don't update using other sources for some time, as because the raid56 feature wasn't yet stable at the lock-in time for whatever versions they stabilized on, they're not likely to get the fixes as it's new-feature material. If you're using a current kernel and tools, however, this issue is fixed. You can look on the wiki for the specific versions, but with the 4.18 kernel current latest stable, it or 4.17, and similar tools versions since the version numbers are synced, are the two latest release series, with the two latest release series being best supported and considered "current" on this list. Also see... 2) General feature maturity: While raid56 mode should be /reasonably/ stable now, it remains one of the newer features and simply hasn't yet had the testing of time that tends to flush out the smaller and corner- case bugs, that more mature features such as raid1 have now had the benefit of. There's nothing to do for this but test, report any bugs you find, and wait for the maturity that time brings. Of course this is one of several reasons we so strongly emphasize and recommend "current" on this list, because even for reasonably stable and mature features such as raid1, btrfs itself remains new enough that they still occasionally get latent bugs found and fixed, and while /some/ of those fixes get backported to LTS kernels (with even less chance for distros to backport tools fixes), not all of them do and even when they do, current still gets the fixes first. 3) The remaining issue is the infamous parity-raid write-hole that affects all parity-raid implementations (not just btrfs) unless they take specific steps to work around the issue. The first thing to point out here again is that it's not btrfs-specific. Between that and the fact that it *ONLY* affects parity-raid operating in degraded mode *WITH* an ungraceful-shutdown recovery situation, it could be argued not to be a btrfs issue at all, but rather one inherent to parity-raid mode and considered an acceptable risk to those choosing parity-raid because it's only a factor when operating degraded, if an ungraceful shutdown does occur. But btrfs' COW nature along with a couple technical implementation factors (the read-modify-write cycle for incomplete stripe widths and how that risks existing metadata when new metadata is written) does amplify the risk somewhat compared to that seen with the same write-hole issue in various other parity-raid implementations that don't avoid it due to taking write-hole avoidance countermeasures. So what can be done right now? As it happens there is a mitigation the admin can currently take -- btrfs allows specifying data and metadata modes separately, and even where raid1 loses too much space to be used for both, it's possible to specify data as raid5/6 and metadata as raid1. While btrfs raid1 only covers loss of a single device, it doesn't have the parity-raid write-hole as it's not parity-raid, and for most use-cases at least, specifying raid1 for metadata only, while raid5 for data, should strictly limit both the risk of the parity-raid write-hole as it'll be limited to data which in most cases will be full-stripe writes and thus not subject to the problem, and the size-doubling of raid1 as it'll be limited to metadata. Meanwhile, arguably, for a sysadmin properly following the sysadmin's first rule of backups, that the true value of data isn't defined by arbitrary claims, but by the number of backups it is considered worth the time/trouble/resources to have of that data, it's a known parity-raid risk specifically limited to the corner-case of having an ungraceful shutdown *WHILE* already operating degraded, and as such, it can be managed along with all the other known risks to the data, including admin fat-fingering, the risk that more devices will go out than the array can tolerate, the risk of general bugs affecting the filesystem or other storage-function related code, etc. IOW, in the context of the admin's first rule of backups, no matter the issue, raid56 write hole or whatever other issue of the many possible, loss of data can *never* be a particularly big issue, because by definition, in *all* cases, what was of most value was saved, either the data if it was defined as valuable enough to have a backup, or the time/ trouble/resources that would have otherwise gone into making that backup, if the data wasn't worth it to have a backup. (One nice thing about this rule is that it covers the loss of whatever number of backups along with the working copy just as well as it does loss of just the working copy. No matter the number of backups, the value of the data is either worth having one more backup, just in case, or it's not. Similarly, the rule covers the age of the backup and updates nicely as well, as that's just a subset of the original problem, with the value of the data in the delta between the last backup and the working copy now being the deciding factor, either the risk of losing it is worth updating the backup, or not, same rule, applied to a data subset.) So from an admin's perspective, in practice, while not entirely stable and mature yet, and with the risk of the already-degraded crash-case corner-case that's already known to apply to parity-raid unless mitigation steps are taken, btrfs raid56 mode should now be within the acceptable risk range already well covered by the risk mitigation of following an appropriate backup policy, optionally combined with the partial write-hole-mitigation strategy of doing data as raid5/6, with metadata as raid1. OK, but what is being done to better mitigate the parity-raid write-hole problem for the future, and when might we be able to use that mitigation? There are a number of possible mitigation strategies, and there's actually code being written using one of them right now, tho it'll be (at least) a few kernel cycles until it's considered complete and stable enough for mainline, and as mentioned in #2 above, even after that it'll take some time to mature to reasonable stability. The strategy being taken is partial-stripe-write logging. Full stripe writes aren't affected by the write hole and (AFAIK) won't be logged, but partial stripe writes are read-modify-write and thus write-hole succeptible, and will be logged. That means small files and modifications to existing files, the ends of large files, and much of the metadata, will be written twice, first to the log, then to the final location. In the event of a crash, on reboot and mount, anything in the log can be replayed, thus preventing the write hole. As for the log, it'll be written using a new 3/4-way-mirroring mode, basically raid1 but mirrored more than two ways (which current btrfs raid1 is limited to, even with more than two devices in the filesystem), thus handling the loss of multiple devices. That's actually what's being developed ATM, the 3/4-way-mirroring mode, which will be available for other uses as well. Actually, that's what I'm personally excited about, as years ago, when I first looked into btrfs, I was running older devices in mdraid's raid1 mode, which does N-way mirroring. I liked the btrfs data checksumming and scrubbing ability, but with the older devices I didn't trust having just two-way-mirroring and wanted at least 3-way-mirroring, so back at that time I skipped btrfs and stayed with mdraid. Later I upgraded to ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when one of the ssds ended up prematurely bad and needed replaced, I would have sure felt a bit better before I got the replacement done, if I still had good two-way-mirroring even with the bad device. So I'm still interested in 3-way-mirroring and would probably use it for some things now, were it available and "stabilish", and I'm eager to see that code merged, not for the parity-raid logging it'll also be used for, but for the reliability of 3-way-mirroring. Tho I'll probably wait at least 2-5 kernel cycles after introduction and see how it stabilizes before actually considering it stable enough to use myself, because even tho I do follow the backups policy above, just because I'm not considering the updated-data delta worth an updated backup yet, doesn't mean I want to unnecessarily risk having to redo the work since the last backup, which means choosing the newer 3-way-mirroring over the more stable and mature existing raid1 2-way-mirroring isn't going to be worth it to me until the 3-way-mirroring has at least a /few/ kernel cycles to stabilize. And I'd recommend the same caution with the new raid5/6 logging mode built on top of that multi-way-mirroring, once it's merged as well. Don't just jump on it immediately after merge unless you're deliberately doing so to help test for bugs and get them fixed and the feature stabilized as soon as possible. Wait a few kernel cycles, follow the list to see how the feature's stability is coming, and /then/ use it, after factoring in its remaining then still new and less mature additional risk into your backup risks profile, of course. Time? Not a dev but following the list and obviously following the new 3- way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring modes, so 4.21/5.1 more reasonably likely (if all goes well, could be longer), probably another couple cycles (if all goes well) after that for the parity-raid logging code built on top of the new mirroring modes, so perhaps a year (~5 kernel cycles) to introduction for it. Then wait however many cycles until you think it has stabilized. Call that another year. So say about 10 kernel cycles or two years. It could be a bit less than that, say 5-7 cycles, if things go well and you take it before I'd really consider it stable enough to recommend, but given the historically much longer than predicted development and stabilization times for raid56 already, it could just as easily end up double that, 4-5 years out, too. But raid56 logging mode for write-hole mitigation is indeed actively being worked on right now. That's what we know at this time. And even before that, right now, raid56 mode should already be reasonably usable, especially if you do data raid5/6 and metadata raid1, as long as your backup policy and practice is equally reasonable. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman