Re: Adventures in btrfs raid5 disk recovery

Steven Haigh Tue, 28 Jun 2016 09:41:50 -0700

On 28/06/16 22:25, Austin S. Hemmelgarn wrote:
> On 2016-06-28 08:14, Steven Haigh wrote:
>> On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
>>> On 2016-06-27 17:57, Zygo Blaxell wrote:
>>>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>>>> <ahferro...@gmail.com> wrote:
>>>>>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>>>>> <ahferro...@gmail.com> wrote:
>>>>>>>
>>>>>>> OK but hold on. During scrub, it should read data, compute checksums
>>>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>>>>>> the checksum tree, and the parity strip in the chunk tree. And if
>>>>>>> parity is wrong, then it should be replaced.
>>>>>>
>>>>>> Except that's horribly inefficient.  With limited exceptions
>>>>>> involving
>>>>>> highly situational co-processors, computing a checksum of a parity
>>>>>> block is
>>>>>> always going to be faster than computing parity for the stripe.  By
>>>>>> using
>>>>>> that to check parity, we can safely speed up the common case of near
>>>>>> zero
>>>>>> errors during a scrub by a pretty significant factor.
>>>>>
>>>>> OK I'm in favor of that. Although somehow md gets away with this by
>>>>> computing and checking parity for its scrubs, and still manages to
>>>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>>>> it fares on SSDs.
>>>>
>>>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>>>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>>>> array of SSDs vs. a slow CPU.
>>> OK, great for people who are using modern desktop or server CPU's.  Not
>>> everyone has that luxury, and even on many such CPU's, it's _still_
>>> faster to computer CRC32c checksums.  On top of that, we don't appear to
>>> be using the in-kernel parity-raid libraries (or if we are, I haven't
>>> been able to find where we are calling the functions for it), so we
>>> don't necessarily get assembly optimized or co-processor accelerated
>>> computation of the parity itself.  The other thing that I didn't mention
>>> above though, is that computing parity checksums will always take less
>>> time than computing parity, because you have to process significantly
>>> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
>>> much data to do the parity checksums instead of parity itself, which
>>> means that the parity computation would need to be 200% faster than the
>>> CRC32c computation to break even, and this margin gets bigger and bigger
>>> as you add more disks.
>>>
>>> On small arrays, this obviously won't have much impact.  Once you start
>>> to scale past a few TB though, even a few hundred MB/s faster processing
>>> means a significant decrease in processing time.  Say you have a CPU
>>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
>>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
>>> instruction and assembly optimized RAID5 parity computations on a modern
>>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
>>> the Xeon E3 in my home server is closer to 5%)).  Assuming those
>>> numbers, and that we're already checking checksums on non-parity blocks,
>>> processing 120TB of data in a 4 disk array (which gives 40TB of parity
>>> data, so 160TB total) gives:
>>> For computing the parity to scrub:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 120TB / 12GB    = 10000 seconds for processing parity of all stripes
>>>                 = 19795.9 seconds total
>>>                 ~ 5.4 hours total
>>>
>>> For computing csums of the parity:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
>>> parity data
>>>                 = 13061.2 seconds total
>>>                 ~ 3.6 hours total
>>>
>>> The checksum based computation is approximately 34% faster than the
>>> parity computation.  Much of this of course is that you have to process
>>> the regular data twice for the parity computation method (once for
>>> csums, once for parity).  You could probably do one pass computing both
>>> values, but that would need to be done carefully; and, without
>>> significant optimization, would likely not get you much benefit other
>>> than cutting the number of loads in half.
>>
>> And it all means jack shit because you don't get the data to disk that
>> quick. Who cares if its 500% faster - if it still saturates the
>> throughput of the actual drives, what difference does it make?
> It has less impact on everything else running on the system at the time
> because it uses less CPU time and potentially less memory.  This is the
> exact same reason that you want your RAID parity computation performance
> as good as possible, the less time the CPU spends on that, the more it
> can spend on other things.  On top of that, there are high-end systems
> that do have SSD's that can get multiple GB/s of data transfer per
> second, and NVDIMM's are starting to become popular in the server
> market, and those give you data transfer speeds equivalent to regular
> memory bandwidth (which can be well over 20GB/s on decent hardware (I've
> got a relatively inexpensive system using DDR3-1866 RAM that has roughly
> 22-24GB/s of memory bandwidth)).  Looking at this another way, the fact
> that the storage device is the bottleneck right now is not a good excuse
> to not worry about making everything else as efficient as possible.


If its purely about performance - then start with multi-thread as a base
- not chopping features to make better performance. I'm not aware of any
modern CPU that comes with a single core these days - so parallel
workloads are much more efficient than a single thread.

Yes, its a law of diminishing returns - but if you're not doing a full
check of data when one would assume you are, then is that broken by design?

Personally, during a scrub, I would want to know if either the checksum
OR the parity is wrong - as that indicates problems at a much deeper
level. As someone who just lost ~4Tb of data due to BTRFS bugs,
protection of data trumps performance in most cases.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

signature.asc
Description: OpenPGP digital signature

Re: Adventures in btrfs raid5 disk recovery

Reply via email to