Gandalf Corvotempesta posted on Wed, 25 Apr 2018 14:30:42 +0200 as
excerpted:

> For me, RAID56 is mandatory. Any ETA for a stable RAID56 ?
> Is something we should expect this year, next year, next 10 years, ....
> ?

It's complicated... is the best short answer to that.  Here's my take at 
a somewhat longer, admin/user-oriented (as I'm not a dev, just a btrfs 
user and list regular), answer.

AFAIK, current status of raid56/parity-raid is "no known major bugs left 
in the current code, but one major caveat, the common to parity-raid 
unless worked around some other way 'degraded-mode parity-raid write 
hole'", which arguably has somewhat more significance in btrfs than other 
parity-raid implications because the current raid56 implementation 
doesn't checksum the parity itself, thus losing some of the data 
integrity safeguards people normally choose btrfs for in the first 
place.  The implications are particularly disturbing with regard to 
metadata because due to parity-raid's read-modify-write cycle, it's not 
just newly written/changed data/metadata that's put at risk, but 
potentially otherwise old and stable data as well.

Again, this is a known issue with parity-raid in general, that simply has 
additional implications on btrfs.  But because it's a generally well 
known issue, there are generally well accepted mitigations available.  
*If* your storage plans account for that with sufficient safeguards such 
as a good (tested) backup routine that ensures that you are actually 
defining as appropriately valuable your data by the number and frequency 
of backups you have of it...  (Data without a backup is simply being 
defined as of less value than the time/trouble/resources necessary to do 
that backup, because if it were more valuable, there'd *BE* that backup.)

... Then AFAIK at this point the only thing btrfs raid56 mode lacks, 
stability-wise, is the testing of time, since until recently there *were* 
severe known bugs, and altho they've now been fixed, the fixes are recent 
enough that it's quite possible that other bugs still remain to show 
themselves, now that the older bugs have been fixed.

My own suggestion for such time-testing is a year, five kernel cycles, 
after the last known severe bug has been fixed.  If there's no hint of 
further reset-the-clock level bugs in that time, then it's reasonable to 
consider, still with some caution and additional safeguards, deployment 
beyond testing.


Meanwhile, as others have mentioned, there are a number of proposals out 
there for write-hole mitigation.

The theoretically cleanest but also the most intensive, since it requires 
rewriting and retesting much of the existing raid56 mode, would be 
rewriting raid56 mode to COW and checksum parity as well.  If this 
happens, it's almost certainly least five years out to well tested and 
could well be a decade out.

Another possibility is taking a technique from zfs, doing stripes of 
varying size (varying number of strips less than the total number of 
devices) depending on how much data is being written.  Btrfs raid56 mode 
can already deal with this to some extent, and does so when some devices 
are smaller than others and thus run out of space, so stripes written 
after that don't include them.  A similar situation occurs when devices 
are added, until a balance redoes existing stripes to take into account 
the new device.  What btrfs raid56 mode /could/ do is extend this and 
handle small writes much as zfs does, deliberately writing less than full-
width stripes when there's less data, thus avoiding read-modify-write of 
existing data/metadata.  A balance could then be scheduled periodically 
to restripe these "short stripes" to full width.

A variant of the above would simply write full-width, but partially 
empty, stripes.  Both of these should be less work to code than the first/
cleanest solution above since they to a large extent simply repurpose 
existing code, but they're somewhat more complicated and thus potentially 
more bug prone, and they both would require periodic rebalancing of the 
short or partially empty stripes to full width for full efficiency.

Finally, there's the possibility of logging partial-width writes before 
actually writing them.  This would be an extension to existing code, and 
would require writing small writes twice, once to the log and then 
rewriting to the main storage at full stripe width with parity.  As a 
result, it'd slow things down (tho only for less than full-width stripe 
writes, full width would be written as normal as they don't involve the 
risky read-modify-write cycle), but people don't choose parity-raid for 
write speed anyway, /because/ of the read-modify-write penalty it imposes.

This last solution should involve the least change to existing code, and 
thus should be the fastest to implement, with the least chance of 
introducing new bugs so the testing and bugfixing cycle should be shorter 
as well, but ouch, that logged-write penalty on top of the read-modify-
write penalty that short-stripe-writes on parity-raid already incurs, 
will really do a number to performance!  But it /should/ finally fix the 
write hole risk, and it'd be the fastest way to do it on top of existing 
code, with the least risk of additional bugs because it's the least new 
code to write.


What I personally suspect will happen is this last solution in the 
shorter term, tho it'll still take some years to be written and tested to 
stability, with the possibility of someone undertaking a btrfs parity-
raid-g2 project implementing the first/cleanest possibility in the longer 
term, say a decade out (which effectively means "whenever someone with 
the skills and motivation decides to try it, could be 5 years out if they 
start today and devote the time to it, could be 15 years out, or never, 
if nobody ever decides to do it).  I honestly don't see the intermediate 
possibilities as worth the trouble, as they'd take too long for not 
enough payback compared to the solutions at either end, but of course, 
someone might just come along that likes and actually implements that 
angle instead.  As always with FLOSS, the one actually doing the 
implementation is the one who decides (subject to maintainer veto, of 
course, and possible distro and ultimate mainlining of the de facto 
situation override of the maintainer, as well).


A single paragraph summary answer?

Current raid56 status-quo is semi-stable, and subject to testing over 
time, is likely to remain there for some time, with the known parity-raid 
write-hole caveat as the biggest issue.  There's discussion of attempts 
to mitigate the write-hole, but the final form such mitigation will take 
remains to be settled, and the shortest-to-stability alternative, logged 
partial-stripe-writes, has serious performance negatives, but that might 
be acceptable given that parity-raid already has read-modify-write 
performance issues so people don't choose it for write performance in any 
case.  That'd be probably 3 years out to stability at the earliest.  
There's a cleaner alternative but it'd be /much/ farther out as it'd 
involve a pretty heavy rewrite along with the long testing and bugfix 
cycle that implies, so ~10 years out if ever, for that.  And there's a 
couple intermediate alternatives as well, but unless something changes I 
don't really see them going anywhere.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to