Re: CEPH Erasure Encoding + OSD Scalability

Loic Dachary Thu, 22 Aug 2013 16:04:24 -0700


On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic, 
> sorry for the late reply, I was on vacation ...  you are right, I did a 
> simple logical mistake since I assumed you loose only the data stripes but 
> never the parity stripes which is a very wrong assumption.
> 
> So for testing you probably could just implement (2+1) and then move to 
> jerasure or dual parity (4+2) where you build horizontal and diagonal 
> parities.
>


Hi Andreas,

That's what I did :-) It would be great if you could review the proposed 
implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep 
working on 
https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 
tomorrow to create the jerasure plugin but it's not ready for review yet. 

Cheers

> Cheers Andreas.
> 
> 
> 
> 
> 
> On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     Hi Andreas,
> 
>     Trying to write minimal code as you suggested, for an example plugin. My 
> first attempt at writing an erasure coding function. I don't get how you can 
> rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
> 
>     Cheers
> 
>     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>     >
>     > Hi Loic,
>     > I don't think there is a better generic implementation. Just made a 
> benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s 
> (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a 
> feeling if you do 10+4 it is 300 MB/s .... there is a specialized 
> implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make 
> a benchmark with this to compare with Jerasure ...
>     >
>     > In any case I would do an optimized implementation for 3+2 which would 
> be probably the most performant implementation having the same reliability 
> like standard 3-fold replication in CEPH using only 53% of the space.
>     >
>     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>     > P1 = A^B
>     > P2 = B^C
>     > and reconstruct with one or two parity operations:
>     > A = P1^B
>     > B = P1^A
>     > B = P2^C
>     > C = P2^B
>     > aso.
>     >
>     > You can write this as a simple loop using advanced vector extensions on 
> Intel (AVX). I can paste a benchmark tomorrow.
>     >
>     > Considering the crc32c-intel code you added ... I would provide a 
> function which provides a crc32c checksum and detects if it can do it using 
> SSE4.2 or implements just the standard algorithm e.g if you run in a virtual 
> machine you need this emulation ...
>     >
>     > Cheers Andreas.
>     > ________________________________________
>     > From: Loic Dachary [[email protected] <mailto:[email protected]>]
>     > Sent: 06 July 2013 22:47
>     > To: Andreas Joachim Peters
>     > Cc: [email protected] <mailto:[email protected]>
>     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>     >
>     > Hi Andreas,
>     >
>     > Since it looks like we're going to use jerasure-1.2, we will be able to 
> try (C)RS using
>     >
>     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>     >
>     > Do you know of a better / faster implementation ? Is there a tradeoff 
> between (C)RS and RS ?
>     >
>     > Cheers
>     >
>     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>     >> HI Loic,
>     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure 
> parity operations, while the standard Reed-Solomon codes need more 
> multiplications and are slower.
>     >>
>     >> Considering the checksumming ... for comparison the CRC32 code from 
> libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 
> CRC32C checksum run's at ~2GByte/s.
>     >>
>     >> Cheers Andreas.
>     >>
>     >>
>     >>
>     >>
>     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >>
>     >>     Hi Andreas,
>     >>
>     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>     >>     > thanks for the responses!
>     >>     >
>     >>     > Maybe this is useful for your erasure code discussion:
>     >>     >
>     >>     > as an example in our RS implementation we chunk a data block of 
> e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>     >>     >
>     >>     > Data & parity chunks are split into 4k blocks and these 4k 
> blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library 
> or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - 
> nothing compared to the parity overhead ...
>     >>     >
>     >>     > You can now easily detect data corruption using the local 
> checksums and avoid to read any parity information and (C)RS decoding if 
> there is no corruption detected. Moreover CRC32C computation is distributed 
> over several (in this case 4) machines while (C)RS decoding would run on a 
> single machine where you assemble a block ... and CRC32C is faster than (C)RS 
> decoding (with SSE4.2) ...
>     >>
>     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>     >>
>     >>     > In our case we write this checksum information separate from the 
> original data ... while in a block-based storage like CEPH it would be 
> probably inlined in the data chunk.
>     >>     > If an OSD detects to run on BRTFS or ZFS one could disable 
> automatically the CRC32C code.
>     >>
>     >>     Nice. I did not know that was built-in :-)
>     >>     
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>     >>
>     >>     > (wouldn't CRC32C be also useful for normal CEPH block 
> replication? )
>     >>
>     >>     I don't know the details of scrubbing but it seems CRC is already 
> used by deep scrubbing
>     >>
>     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>     >>
>     >>     Cheers
>     >>
>     >>     > As far as I know with the RS CODEC we use you can either miss 
> stripes (data =0) in the decoding process but you cannot inject corrupted 
> stripes into the decoding process, so the block checksumming is important.
>     >>     >
>     >>     > Cheers Andreas.
>     >>
>     >>     --
>     >>     Loïc Dachary, Artisan Logiciel Libre
>     >>     All that is necessary for the triumph of evil is that good people 
> do nothing.
>     >>
>     >>
>     >
>     > --
>     > Loïc Dachary, Artisan Logiciel Libre
>     > All that is necessary for the triumph of evil is that good people do 
> nothing.
>     >
>     > --
>     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     > the body of a message to [email protected] 
> <mailto:[email protected]>
>     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do 
> nothing.
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

signature.asc
Description: OpenPGP digital signature

Re: CEPH Erasure Encoding + OSD Scalability

Reply via email to