Hi, On Thu, Oct 29, 2015 at 3:21 PM, Michael Meeks <[email protected]> wrote: > Hi Kohei, > > I'd love some input (if you have a minute) on the attached. The > punch-line is, that if we want to do really fast arithmetic, we start to > need to do some odd things; while I suspect that this piece of unrolling > can be done with the iterator - the next step I'm poking at (SSE3 > assembler ;-) is not going to like that.
You don't need SSE3 assembler for that - just use SSE(3) intrinsics.. SSE uses 128 registers so you can do 2 doubles at the same time. Best is to have a twosums as __m128d and then sum the two doubles in the end. __m128d twosums = _mm_set_pd (0.0, 0.0); then do a similar unrolled for loop to sum 8 values at a time: __m128d first = _mm_load_pd1(p[i]); __m128d second = _mm_load_pd1(p[i]+2); _mm_add_pd(twosums, first); _mm_add_pd(twosums, second); in the end just sum the two doubles in twosums and handle the rest of corner cases... Even faster it would be if the array is aligned to 16 byte boundary - then you can use _mm_load_pd. > ATB, > > Michael. Regards, Tomaž _______________________________________________ LibreOffice mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/libreoffice
