Hi Loic, sorry for the late reply, I was on vacation ... you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption. Ignore the proposal !
So for testing you probably could just implement (2+1) and then use jerasure or use the dual parity (4+2) algorithm where you build horizontal and diagonal parities (however there is a patent on RAID-DP) Cheers Andreas. ________________________________________ From: Loic Dachary [[email protected]] Sent: 19 August 2013 12:35 To: Andreas Joachim Peters Cc: [email protected] Subject: Re: CEPH Erasure Encoding + OSD Scalability Hi Andreas, Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-) Cheers On 07/07/2013 23:04, Andreas Joachim Peters wrote: > > Hi Loic, > I don't think there is a better generic implementation. Just made a benchmark > .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon > 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if > you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS > (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this > to compare with Jerasure ... > > In any case I would do an optimized implementation for 3+2 which would be > probably the most performant implementation having the same reliability like > standard 3-fold replication in CEPH using only 53% of the space. > > 3+2 is trivial since you encode (A,B,C) with only two parity operations > P1 = A^B > P2 = B^C > and reconstruct with one or two parity operations: > A = P1^B > B = P1^A > B = P2^C > C = P2^B > aso. > > You can write this as a simple loop using advanced vector extensions on Intel > (AVX). I can paste a benchmark tomorrow. > > Considering the crc32c-intel code you added ... I would provide a function > which provides a crc32c checksum and detects if it can do it using SSE4.2 or > implements just the standard algorithm e.g if you run in a virtual machine > you need this emulation ... > > Cheers Andreas. > ________________________________________ > From: Loic Dachary [[email protected]] > Sent: 06 July 2013 22:47 > To: Andreas Joachim Peters > Cc: [email protected] > Subject: Re: CEPH Erasure Encoding + OSD Scalability > > Hi Andreas, > > Since it looks like we're going to use jerasure-1.2, we will be able to try > (C)RS using > > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h > > Do you know of a better / faster implementation ? Is there a tradeoff between > (C)RS and RS ? > > Cheers > > On 06/07/2013 15:43, Andreas-Joachim Peters wrote: >> HI Loic, >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure >> parity operations, while the standard Reed-Solomon codes need more >> multiplications and are slower. >> >> Considering the checksumming ... for comparison the CRC32 code from libz >> run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 >> CRC32C checksum run's at ~2GByte/s. >> >> Cheers Andreas. >> >> >> >> >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Andreas, >> >> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic, >> > thanks for the responses! >> > >> > Maybe this is useful for your erasure code discussion: >> > >> > as an example in our RS implementation we chunk a data block of e.g. >> 4M into 4 data chunks of 1M. Then we create a 2 parity chunks. >> > >> > Data & parity chunks are split into 4k blocks and these 4k blocks get >> a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). >> This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing >> compared to the parity overhead ... >> > >> > You can now easily detect data corruption using the local checksums >> and avoid to read any parity information and (C)RS decoding if there is no >> corruption detected. Moreover CRC32C computation is distributed over several >> (in this case 4) machines while (C)RS decoding would run on a single machine >> where you assemble a block ... and CRC32C is faster than (C)RS decoding >> (with SSE4.2) ... >> >> What does (C)RS mean ? (C)Reed-Solomon ? >> >> > In our case we write this checksum information separate from the >> original data ... while in a block-based storage like CEPH it would be >> probably inlined in the data chunk. >> > If an OSD detects to run on BRTFS or ZFS one could disable >> automatically the CRC32C code. >> >> Nice. I did not know that was built-in :-) >> >> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing >> >> > (wouldn't CRC32C be also useful for normal CEPH block replication? ) >> >> I don't know the details of scrubbing but it seems CRC is already used >> by deep scrubbing >> >> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731 >> >> Cheers >> >> > As far as I know with the RS CODEC we use you can either miss stripes >> (data =0) in the decoding process but you cannot inject corrupted stripes >> into the decoding process, so the block checksumming is important. >> > >> > Cheers Andreas. >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> All that is necessary for the triumph of evil is that good people do >> nothing. >> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
