wow, holy shit, thanks for this extended answer!

> The first thing to point out here again is that it's not btrfs-specific.  
so that mean that every RAID implemantation (with parity) has such Bug? I'm 
looking a bit, it looks like that ZFS doesn't have a write hole. And it _only_ 
happens when the server has a ungraceful shutdown, caused by poweroutage? So 
that mean if I running btrfs raid5/6 and I've no poweroutages I've no problems?

>  it's possible to specify data as raid5/6 and metadata as raid1
does some have this in production? ZFS btw have 2 copies of metadata by 
default, maybe it would also be an option or btrfs?
in this case you think 'btrfs fi balance start -mconvert=raid1 -dconvert=raid5 
/path ' is safe at the moment?

> That means small files and modifications to existing files, the ends of large 
> files, and much of the 
> metadata, will be written twice, first to the log, then to the final 
> location. 
that sounds that the performance will go down? So far as I can see btrfs can't 
beat ext4 or btrfs nor zfs and then they will made it even slower?

thanks in advanced!

best regards
Stefan



On Saturday, September 8, 2018 8:40:50 AM CEST Duncan wrote:
> Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:
> 
> > sorry for disturb this discussion,
> > 
> > are there any plans/dates to fix the raid5/6 issue? Is somebody working
> > on this issue? Cause this is for me one of the most important things for
> > a fileserver, with a raid1 config I loose to much diskspace.
> 
> There's a more technically complete discussion of this in at least two 
> earlier threads you can find on the list archive, if you're interested, 
> but here's the basics (well, extended basics...) from a btrfs-using-
> sysadmin perspective.
> 
> "The raid5/6 issue" can refer to at least three conceptually separate 
> issues, with different states of solution maturity:
> 
> 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
> the historic) in current kernels and tools.  Unfortunately these will 
> still affect for some time many users of longer-term stale^H^Hble distros 
> who don't update using other sources for some time, as because the raid56 
> feature wasn't yet stable at the lock-in time for whatever versions they 
> stabilized on, they're not likely to get the fixes as it's new-feature 
> material.
> 
> If you're using a current kernel and tools, however, this issue is 
> fixed.  You can look on the wiki for the specific versions, but with the 
> 4.18 kernel current latest stable, it or 4.17, and similar tools versions 
> since the version numbers are synced, are the two latest release series, 
> with the two latest release series being best supported and considered 
> "current" on this list.
> 
> Also see...
> 
> 2) General feature maturity:  While raid56 mode should be /reasonably/ 
> stable now, it remains one of the newer features and simply hasn't yet 
> had the testing of time that tends to flush out the smaller and corner-
> case bugs, that more mature features such as raid1 have now had the 
> benefit of.
> 
> There's nothing to do for this but test, report any bugs you find, and 
> wait for the maturity that time brings.
> 
> Of course this is one of several reasons we so strongly emphasize and 
> recommend "current" on this list, because even for reasonably stable and 
> mature features such as raid1, btrfs itself remains new enough that they 
> still occasionally get latent bugs found and fixed, and while /some/ of 
> those fixes get backported to LTS kernels (with even less chance for 
> distros to backport tools fixes), not all of them do and even when they 
> do, current still gets the fixes first.
> 
> 3) The remaining issue is the infamous parity-raid write-hole that 
> affects all parity-raid implementations (not just btrfs) unless they take 
> specific steps to work around the issue.
> 
> The first thing to point out here again is that it's not btrfs-specific.  
> Between that and the fact that it *ONLY* affects parity-raid operating in 
> degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
> be argued not to be a btrfs issue at all, but rather one inherent to 
> parity-raid mode and considered an acceptable risk to those choosing 
> parity-raid because it's only a factor when operating degraded, if an 
> ungraceful shutdown does occur.
> 
> But btrfs' COW nature along with a couple technical implementation 
> factors (the read-modify-write cycle for incomplete stripe widths and how 
> that risks existing metadata when new metadata is written) does amplify 
> the risk somewhat compared to that seen with the same write-hole issue in 
> various other parity-raid implementations that don't avoid it due to 
> taking write-hole avoidance countermeasures.
> 
> 
> So what can be done right now?
> 
> As it happens there is a mitigation the admin can currently take -- btrfs 
> allows specifying data and metadata modes separately, and even where 
> raid1 loses too much space to be used for both, it's possible to specify 
> data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
> loss of a single device, it doesn't have the parity-raid write-hole as 
> it's not parity-raid, and for most use-cases at least, specifying raid1 
> for metadata only, while raid5 for data, should strictly limit both the 
> risk of the parity-raid write-hole as it'll be limited to data which in 
> most cases will be full-stripe writes and thus not subject to the 
> problem, and the size-doubling of raid1 as it'll be limited to metadata.
> 
> Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
> first rule of backups, that the true value of data isn't defined by 
> arbitrary claims, but by the number of backups it is considered worth the 
> time/trouble/resources to have of that data, it's a known parity-raid 
> risk specifically limited to the corner-case of having an ungraceful 
> shutdown *WHILE* already operating degraded, and as such, it can be 
> managed along with all the other known risks to the data, including admin 
> fat-fingering, the risk that more devices will go out than the array can 
> tolerate, the risk of general bugs affecting the filesystem or other 
> storage-function related code, etc.
> 
> IOW, in the context of the admin's first rule of backups, no matter the 
> issue, raid56 write hole or whatever other issue of the many possible, 
> loss of data can *never* be a particularly big issue, because by 
> definition, in *all* cases, what was of most value was saved, either the 
> data if it was defined as valuable enough to have a backup, or the time/
> trouble/resources that would have otherwise gone into making that backup, 
> if the data wasn't worth it to have a backup.
> 
> (One nice thing about this rule is that it covers the loss of whatever 
> number of backups along with the working copy just as well as it does 
> loss of just the working copy.  No matter the number of backups, the 
> value of the data is either worth having one more backup, just in case, 
> or it's not.  Similarly, the rule covers the age of the backup and 
> updates nicely as well, as that's just a subset of the original problem, 
> with the value of the data in the delta between the last backup and the 
> working copy now being the deciding factor, either the risk of losing it 
> is worth updating the backup, or not, same rule, applied to a data 
> subset.)
> 
> So from an admin's perspective, in practice, while not entirely stable 
> and mature yet, and with the risk of the already-degraded crash-case 
> corner-case that's already known to apply to parity-raid unless 
> mitigation steps are taken, btrfs raid56 mode should now be within the 
> acceptable risk range already well covered by the risk mitigation of 
> following an appropriate backup policy, optionally combined with the 
> partial write-hole-mitigation strategy of doing data as raid5/6, with 
> metadata as raid1.
> 
> 
> OK, but what is being done to better mitigate the parity-raid write-hole 
> problem for the future, and when might we be able to use that mitigation?
> 
> There are a number of possible mitigation strategies, and there's 
> actually code being written using one of them right now, tho it'll be (at 
> least) a few kernel cycles until it's considered complete and stable 
> enough for mainline, and as mentioned in #2 above, even after that it'll 
> take some time to mature to reasonable stability.
> 
> The strategy being taken is partial-stripe-write logging.  Full stripe 
> writes aren't affected by the write hole and (AFAIK) won't be logged, but 
> partial stripe writes are read-modify-write and thus write-hole 
> succeptible, and will be logged.  That means small files and 
> modifications to existing files, the ends of large files, and much of the 
> metadata, will be written twice, first to the log, then to the final 
> location.  In the event of a crash, on reboot and mount, anything in the 
> log can be replayed, thus preventing the write hole.
> 
> As for the log, it'll be written using a new 3/4-way-mirroring mode, 
> basically raid1 but mirrored more than two ways (which current btrfs 
> raid1 is limited to, even with more than two devices in the filesystem), 
> thus handling the loss of multiple devices.
> 
> That's actually what's being developed ATM, the 3/4-way-mirroring mode, 
> which will be available for other uses as well.
> 
> Actually, that's what I'm personally excited about, as years ago, when I 
> first looked into btrfs, I was running older devices in mdraid's raid1 
> mode, which does N-way mirroring.  I liked the btrfs data checksumming 
> and scrubbing ability, but with the older devices I didn't trust having 
> just two-way-mirroring and wanted at least 3-way-mirroring, so back at 
> that time I skipped btrfs and stayed with mdraid.  Later I upgraded to 
> ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when 
> one of the ssds ended up prematurely bad and needed replaced, I would 
> have sure felt a bit better before I got the replacement done, if I still 
> had good two-way-mirroring even with the bad device.
> 
> So I'm still interested in 3-way-mirroring and would probably use it for 
> some things now, were it available and "stabilish", and I'm eager to see 
> that code merged, not for the parity-raid logging it'll also be used for, 
> but for the reliability of 3-way-mirroring.  Tho I'll probably wait at 
> least 2-5 kernel cycles after introduction and see how it stabilizes 
> before actually considering it stable enough to use myself, because even 
> tho I do follow the backups policy above, just because I'm not 
> considering the updated-data delta worth an updated backup yet, doesn't 
> mean I want to unnecessarily risk having to redo the work since the last 
> backup, which means choosing the newer 3-way-mirroring over the more 
> stable and mature existing raid1 2-way-mirroring isn't going to be worth 
> it to me until the 3-way-mirroring has at least a /few/ kernel cycles to 
> stabilize.
> 
> And I'd recommend the same caution with the new raid5/6 logging mode 
> built on top of that multi-way-mirroring, once it's merged as well.  
> Don't just jump on it immediately after merge unless you're deliberately 
> doing so to help test for bugs and get them fixed and the feature 
> stabilized as soon as possible.  Wait a few kernel cycles, follow the 
> list to see how the feature's stability is coming, and /then/ use it, 
> after factoring in its remaining then still new and less mature 
> additional risk into your backup risks profile, of course.
> 
> Time?  Not a dev but following the list and obviously following the new 3-
> way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
> modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
> longer), probably another couple cycles (if all goes well) after that for 
> the parity-raid logging code built on top of the new mirroring modes, so 
> perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
> however many cycles until you think it has stabilized.  Call that another 
> year.  So say about 10 kernel cycles or two years.  It could be a bit 
> less than that, say 5-7 cycles, if things go well and you take it before 
> I'd really consider it stable enough to recommend, but given the 
> historically much longer than predicted development and stabilization 
> times for raid56 already, it could just as easily end up double that, 4-5 
> years out, too.
> 
> But raid56 logging mode for write-hole mitigation is indeed actively 
> being worked on right now.  That's what we know at this time.
> 
> And even before that, right now, raid56 mode should already be reasonably 
> usable, especially if you do data raid5/6 and metadata raid1, as long as 
> your backup policy and practice is equally reasonable.
> 
> 

Reply via email to