Hi Andreas, That sounds reasonable. Would you be so kind as to send a patch with your changes ? I'll rework it into something that fits the test infrastructure of Ceph.
Cheers On 22/09/2013 09:26, Andreas Joachim Peters wrote: > Hi Loic, > I run a benchmark with the changed code tomorrow ... I actually had to insert > some of my realtime benchmark macro's into your Jerasure code to see the > different time fractions between buffer preparation & encoding step, but for > you QA suite it is probably enough to get a total value after your fix. I > will send you a program sampling the performance at different buffer sizes > and encoding types. > > I changed my code to use vector operations (128-bit XOR's) and it gives > another 10% gain. I also want to try out if it makes sense to do the CRC32C > computation in-line in the encoding step and compare it with the two step > procedure first encoding all blocks, then CRC32C on all blocks. > > Cheers Andreas. > > > > ________________________________________ > From: Loic Dachary [[email protected]] > Sent: 21 September 2013 17:11 > To: Andreas Joachim Peters > Cc: [email protected] > Subject: Re: CEPH Erasure Encoding + OSD Scalability > > Hi Andreas, > > It's probably too soon to be smart about reducing the number of copies, but > you're right : this copy is not necessary. The following pull request gets > rid of it: > > https://github.com/ceph/ceph/pull/615 > > Cheers > > On 20/09/2013 18:49, Loic Dachary wrote: >> Hi, >> >> This is a first attempt at avoiding unnecessary copy: >> >> https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66 >> >> I'm not sure how it could be made more readable / terse with bufferlist >> iterators. Any kind of hint would be welcome :-) >> >> Cheers >> >> On 20/09/2013 17:36, Sage Weil wrote: >>> On Fri, 20 Sep 2013, Loic Dachary wrote: >>>> Hi Andreas, >>>> >>>> Great work on these benchmarks ! It's definitely an incentive to improve >>>> as much as possible. Could you push / send the scripts and sequence of >>>> operations you've used ? I'll reproduce this locally while getting rid of >>>> the extra copy. It would be useful to capture that into a script that can >>>> be conveniently run from the teuthology integrations tests to check >>>> against performance regressions. >>>> >>>> Regarding the 3P implementation, in my opinion it would be very valuable >>>> for some people who prefer low CPU consumption. And I'm eager to see more >>>> than one plugin in the erasure code plugin directory ;-) >>> >>> One way to approach this might be to make a bufferlist 'multi-iterator' >>> that you give you bufferlist::iterator's and will give you back a pair of >>> points and length for each contiguous segment. This would capture the >>> annoying iterator details and let the user focus on processing chunks that >>> are as large as possible. >>> >>> sage >>> >>> >>> > >>>> Cheers >>>> >>>> On 20/09/2013 13:35, Andreas Joachim Peters wrote: >>>>> Hi Loic, >>>>> >>>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) >>>>> for ENCODING based on the CEPH Jerasure port. >>>>> I measured for objects from 128k to 512 MB with random contents (if you >>>>> encode 1 GB objects you see slow downs due to caching inefficiencies >>>>> ...), otherwise results are stable for the given object sizes. >>>>> >>>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) >>>>> , the other are significantly slower (2-3x slower) and my 3P(3,2,1) >>>>> implementation providing the same redundancy level like RS-Raid6[3,2] >>>>> (double disk failure) but using more space (66% vs 100% overhead). >>>>> >>>>> The effect of out.c_str() is significant ( contributes with factor 2 >>>>> slow-down for the best jerasure algorithm for [3,2] ). >>>>> >>>>> Averaged results for Objects Size 4MB: >>>>> >>>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms >>>>> encoding => ~780 MB/s >>>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in >>>>> the algorithm) - 0.87ms encoding => ~4.4 GB/s >>>>> >>>>> I think it pays off to avoid the copy in the encoding if it does not >>>>> matter for the buffer handling upstream and pad only the last chunk. >>>>> >>>>> Last thing I tested is how performances scales with number of cores >>>>> running 4 tests in parallel: >>>>> >>>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz). >>>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz). >>>>> >>>>> I also implemented the decoding for 3P, but didn't test yet all >>>>> reconstruction cases. There is probably room for improvements using AVX >>>>> support for XOR operations in both implementations. >>>>> >>>>> Before I invest more time, do think it is useful to have this fast 3P >>>>> algorithm for double disk failures with 100% space overhead? Because I >>>>> believe that people will always optimize for space and would rather use >>>>> something like (10,2) even if the performance degrades and CPU >>>>> consumption goes up?!? Let me know, no problem in any case! >>>>> >>>>> Finally I tested some combinations for >>>>> ErasureCodeJerasureReedSolomonRAID6: >>>>> >>>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s >>>>> >>>>> Cheers Andreas. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Lo?c Dachary, Artisan Logiciel Libre >>>> All that is necessary for the triumph of evil is that good people do >>>> nothing. >>>> >>>> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
signature.asc
Description: OpenPGP digital signature
