RE: CEPH Erasure Encoding + OSD Scalability

Andreas Joachim Peters Thu, 22 Aug 2013 14:56:30 -0700

Hi Loic, 
sorry for the late reply, I was on vacation ...  you are right, I did a simple 
logical mistake since I assumed you loose only the data stripes but never the 
parity stripes which is a very wrong assumption. Ignore the proposal !


So for testing you probably could just implement (2+1) and then use jerasure or 
use the dual parity (4+2) algorithm where you build horizontal and diagonal 
parities (however there is a patent on RAID-DP)

Cheers Andreas.

________________________________________
From: Loic Dachary [[email protected]]
Sent: 19 August 2013 12:35
To: Andreas Joachim Peters
Cc: [email protected]
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

Trying to write minimal code as you suggested, for an example plugin. My first 
attempt at writing an erasure coding function. I don't get how you can rebuild 
P1 + A from P2 + B + C. I must be missing something obvious :-)

Cheers

On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>
> Hi Loic,
> I don't think there is a better generic implementation. Just made a benchmark 
> .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 
> 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if 
> you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS 
> (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this 
> to compare with Jerasure ...
>
> In any case I would do an optimized implementation for 3+2 which would be 
> probably the most performant implementation having the same reliability like 
> standard 3-fold replication in CEPH using only 53% of the space.
>
> 3+2 is trivial since you encode (A,B,C) with only two parity operations
> P1 = A^B
> P2 = B^C
> and reconstruct with one or two parity operations:
> A = P1^B
> B = P1^A
> B = P2^C
> C = P2^B
> aso.
>
> You can write this as a simple loop using advanced vector extensions on Intel 
> (AVX). I can paste a benchmark tomorrow.
>
> Considering the crc32c-intel code you added ... I would provide a function 
> which provides a crc32c checksum and detects if it can do it using SSE4.2 or 
> implements just the standard algorithm e.g if you run in a virtual machine 
> you need this emulation ...
>
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [[email protected]]
> Sent: 06 July 2013 22:47
> To: Andreas Joachim Peters
> Cc: [email protected]
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> Hi Andreas,
>
> Since it looks like we're going to use jerasure-1.2, we will be able to try 
> (C)RS using
>
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>
> Do you know of a better / faster implementation ? Is there a tradeoff between 
> (C)RS and RS ?
>
> Cheers
>
> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>> HI Loic,
>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure 
>> parity operations, while the standard Reed-Solomon codes need more 
>> multiplications and are slower.
>>
>> Considering the checksumming ... for comparison the CRC32 code from libz 
>> run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 
>> CRC32C checksum run's at ~2GByte/s.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>>     Hi Andreas,
>>
>>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>     > thanks for the responses!
>>     >
>>     > Maybe this is useful for your erasure code discussion:
>>     >
>>     > as an example in our RS implementation we chunk a data block of e.g. 
>> 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>     >
>>     > Data & parity chunks are split into 4k blocks and these 4k blocks get 
>> a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). 
>> This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing 
>> compared to the parity overhead ...
>>     >
>>     > You can now easily detect data corruption using the local checksums 
>> and avoid to read any parity information and (C)RS decoding if there is no 
>> corruption detected. Moreover CRC32C computation is distributed over several 
>> (in this case 4) machines while (C)RS decoding would run on a single machine 
>> where you assemble a block ... and CRC32C is faster than (C)RS decoding 
>> (with SSE4.2) ...
>>
>>     What does (C)RS mean ? (C)Reed-Solomon ?
>>
>>     > In our case we write this checksum information separate from the 
>> original data ... while in a block-based storage like CEPH it would be 
>> probably inlined in the data chunk.
>>     > If an OSD detects to run on BRTFS or ZFS one could disable 
>> automatically the CRC32C code.
>>
>>     Nice. I did not know that was built-in :-)
>>     
>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>
>>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>
>>     I don't know the details of scrubbing but it seems CRC is already used 
>> by deep scrubbing
>>
>>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>
>>     Cheers
>>
>>     > As far as I know with the RS CODEC we use you can either miss stripes 
>> (data =0) in the decoding process but you cannot inject corrupted stripes 
>> into the decoding process, so the block checksumming is important.
>>     >
>>     > Cheers Andreas.
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do 
>> nothing.
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: CEPH Erasure Encoding + OSD Scalability

Reply via email to