On 24/08/2013 21:41, Loic Dachary wrote: > > > On 24/08/2013 15:30, Andreas-Joachim Peters wrote: >> Hi Loic, >> I will start to review > > Cool :-) > > ...maybe you can briefly explain few things beforehand: >> >> 1) the buffer management .... who allocates the output buffers for the >> encoding? Are they always malloced or does it use some generic CEPH buffer >> recyling functionality? > > The output bufferlist is allocated by the pluing and it is the responsibility > of the caller to deallocate them. I will write doxygen documentation > https://github.com/ceph/ceph/pull/518/files#r5966727
Hi Andreas, The documentation added today in https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterface.h will hopefully clarify things. It requires an understanding of https://github.com/ceph/ceph/blob/master/src/include/buffer.h Let me know if you have more questions. > >> 2) do you support to retrieve partial blocks or only the full 4M block? are >> decoded blocks cached for some time? > > This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : > the plugin can handle encode/decode of 128 bytes or 4M in the same way. > >> 3) do you want to tune the 2+1 basic code for performance or is it just >> proof of concept? If yes, then you should move over the encoding buffer with >> *ptr++ and use the largest available vector size for the used platform to >> perform XOR operations. I will send you an improved version of the loop if >> you want ... > > The 2+1 is just a proof of concept. I completed a first implementation of the > jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to > be used as a default. > >> 4) if you are interested I can write also code for a (3+3) plugin which >> tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). >> Atleast it reduces the overhead from 3-fold replication from 300% => 200% ... > > It would be great to have such a plugin :-) > >> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or >> will this be a CEPH generic functionality for any kind of block? > > The idea is to have a CRC32C checksum per object / shard ( as described in > http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it > is the only way for scrubbing to figure out if a given shard is not corrupted > and not too expensive since erasure coded pool only support full writes + > append and not partial writes that would require to re-calculate the CRC32C > for the whole shard each time one byte is changed. > >> 6) do you put a kind of header or magic into the encoded blocks to verify >> that your input blocks are actually corresponding? > > This has not been decided yet but I think it would be sensible to use the > object attributes ( either xattr or leveldb ) to store meta information > instead of creating a file format specifically designed for erasure code. > > Cheers > >> Cheers Andreas. >> >> >> >> >> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> >> On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic, >> > sorry for the late reply, I was on vacation ... you are right, I did >> a simple logical mistake since I assumed you loose only the data stripes but >> never the parity stripes which is a very wrong assumption. >> > >> > So for testing you probably could just implement (2+1) and then move >> to jerasure or dual parity (4+2) where you build horizontal and diagonal >> parities. >> > >> >> Hi Andreas, >> >> That's what I did :-) It would be great if you could review the proposed >> implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep >> working on >> https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 >> tomorrow to create the jerasure plugin but it's not ready for review yet. >> >> Cheers >> >> > Cheers Andreas. >> > >> > >> > >> > >> > >> > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> > >> > Hi Andreas, >> > >> > Trying to write minimal code as you suggested, for an example >> plugin. My first attempt at writing an erasure coding function. I don't get >> how you can rebuild P1 + A from P2 + B + C. I must be missing something >> obvious :-) >> > >> > Cheers >> > >> > On 07/07/2013 23:04, Andreas Joachim Peters wrote: >> > > >> > > Hi Loic, >> > > I don't think there is a better generic implementation. Just >> made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives >> 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to >> give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized >> implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will >> make a benchmark with this to compare with Jerasure ... >> > > >> > > In any case I would do an optimized implementation for 3+2 which >> would be probably the most performant implementation having the same >> reliability like standard 3-fold replication in CEPH using only 53% of the >> space. >> > > >> > > 3+2 is trivial since you encode (A,B,C) with only two parity >> operations >> > > P1 = A^B >> > > P2 = B^C >> > > and reconstruct with one or two parity operations: >> > > A = P1^B >> > > B = P1^A >> > > B = P2^C >> > > C = P2^B >> > > aso. >> > > >> > > You can write this as a simple loop using advanced vector >> extensions on Intel (AVX). I can paste a benchmark tomorrow. >> > > >> > > Considering the crc32c-intel code you added ... I would provide >> a function which provides a crc32c checksum and detects if it can do it >> using SSE4.2 or implements just the standard algorithm e.g if you run in a >> virtual machine you need this emulation ... >> > > >> > > Cheers Andreas. >> > > ________________________________________ >> > > From: Loic Dachary [[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>] >> > > Sent: 06 July 2013 22:47 >> > > To: Andreas Joachim Peters >> > > Cc: [email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>> >> > > Subject: Re: CEPH Erasure Encoding + OSD Scalability >> > > >> > > Hi Andreas, >> > > >> > > Since it looks like we're going to use jerasure-1.2, we will be >> able to try (C)RS using >> > > >> > > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c >> > > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h >> > > >> > > Do you know of a better / faster implementation ? Is there a >> tradeoff between (C)RS and RS ? >> > > >> > > Cheers >> > > >> > > On 06/07/2013 15:43, Andreas-Joachim Peters wrote: >> > >> HI Loic, >> > >> (C)RS stands for the Cauchy Reed-Solomon codes which are based >> on pure parity operations, while the standard Reed-Solomon codes need more >> multiplications and are slower. >> > >> >> > >> Considering the checksumming ... for comparison the CRC32 code >> from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while >> SSE4.2 CRC32C checksum run's at ~2GByte/s. >> > >> >> > >> Cheers Andreas. >> > >> >> > >> >> > >> >> > >> >> > >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>> <mailto:[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>>> wrote: >> > >> >> > >> Hi Andreas, >> > >> >> > >> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic, >> > >> > thanks for the responses! >> > >> > >> > >> > Maybe this is useful for your erasure code discussion: >> > >> > >> > >> > as an example in our RS implementation we chunk a data >> block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks. >> > >> > >> > >> > Data & parity chunks are split into 4k blocks and these >> 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT >> library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 >> bytes) - nothing compared to the parity overhead ... >> > >> > >> > >> > You can now easily detect data corruption using the local >> checksums and avoid to read any parity information and (C)RS decoding if >> there is no corruption detected. Moreover CRC32C computation is distributed >> over several (in this case 4) machines while (C)RS decoding would run on a >> single machine where you assemble a block ... and CRC32C is faster than >> (C)RS decoding (with SSE4.2) ... >> > >> >> > >> What does (C)RS mean ? (C)Reed-Solomon ? >> > >> >> > >> > In our case we write this checksum information separate >> from the original data ... while in a block-based storage like CEPH it would >> be probably inlined in the data chunk. >> > >> > If an OSD detects to run on BRTFS or ZFS one could >> disable automatically the CRC32C code. >> > >> >> > >> Nice. I did not know that was built-in :-) >> > >> >> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing >> > >> >> > >> > (wouldn't CRC32C be also useful for normal CEPH block >> replication? ) >> > >> >> > >> I don't know the details of scrubbing but it seems CRC is >> already used by deep scrubbing >> > >> >> > >> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731 >> > >> >> > >> Cheers >> > >> >> > >> > As far as I know with the RS CODEC we use you can either >> miss stripes (data =0) in the decoding process but you cannot inject >> corrupted stripes into the decoding process, so the block checksumming is >> important. >> > >> > >> > >> > Cheers Andreas. >> > >> >> > >> -- >> > >> Loïc Dachary, Artisan Logiciel Libre >> > >> All that is necessary for the triumph of evil is that good >> people do nothing. >> > >> >> > >> >> > > >> > > -- >> > > Loïc Dachary, Artisan Logiciel Libre >> > > All that is necessary for the triumph of evil is that good >> people do nothing. >> > > >> > > -- >> > > To unsubscribe from this list: send the line "unsubscribe >> ceph-devel" in >> > > the body of a message to [email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>> >> > > More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> > > >> > >> > -- >> > Loïc Dachary, Artisan Logiciel Libre >> > All that is necessary for the triumph of evil is that good people >> do nothing. >> > >> > >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> All that is necessary for the triumph of evil is that good people do >> nothing. >> >> > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
signature.asc
Description: OpenPGP digital signature
