Wols Lists <[email protected]> writes:
> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>>
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>>
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>>
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>>
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>> all disks to recover the one failed disk so non-affected disks can be
>> used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...
And throttling in a production environment is not exactly
desired. Imagine a 500 disk array (and yes this is something we've seen
with MD) and you have to replace disks. While the array is rebuilt you
have to throttle all I/O because with raid-{1,5,6,10} all data is
striped across all disks.
With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
it) you don't have to as the I/O is replicated in parity groups across a
subset of disks. All I/O targeting disks which aren't needed to recover
the data from the failed disks aren't affected by the throttling at all.
>> 2) an efficient mapping function can improve parallel I/O submission, as
>> two different I/Os are not necessarily going to the same disks in the
>> array.
>>
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>>
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...
raid-6 won't help you much in above scenario.
>>
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>> as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>> the NetApp E-Series Arrays do use CRUSH for parity declustering)
>>
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>>
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".
Well CRUSH is (one of) the algorithms behind Ceph. It takes the
decisions where to place a block. It is just a hash (well technically a
weighted decision-tree) function that takes a block of I/O and a some
configuration parameters and "calculates" the placement.
> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.
Thanks for the input.
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850