Andrei:

>That's pretty cool!<

LLVM compiler can't unroll loops that loop a number of times known at compile 
time (I have an open LLVM bug report on this), and even when this bug gets 
fixed, then LLVM isn't able to vectorize yet, so it doesn't use SIMD 
instructions like movapd, mulpd etc that work with two doubles at the same time 
(unless you use intrinsics instructions like _mm_load_pd that are currently not 
available with DMD but can be used with LDC too). So this routine is an 
improvement even on the asm produced by LDC from D code.

But there are some other things you need to consider:

dot5() works with arrays of doubles only; but it's not too much hard to write a 
version for floats, if you want I can write it in some time (I am slow in 
writing asm). A version for reals is probably less important.

dot5() uses the movapd instruction, so it requires the doubles to be aligned to 
16 bytes. Currently it just asserts if data is not aligned, but such assert is 
kind of useless. This problem can be solved in three ways: adding a bit of code 
(in D or asm) in the beginning to use the first unaligned double if present. 
This branch slows down code a bit (such slow down can be felt only for small 
arrays, of course). Otherwise there are other solutions, none of them is 
perfect. 

Currently in D2 big dynamic arrays are not aligned to 16 bytes, but Steven 
Schveighoffer will fix it:
http://d.puremagic.com/issues/show_bug.cgi?id=4400

Even when Steven has fixed the alignment to 16 bytes, you just need to slice 
the array like [1..$] to have an unaligned array. The type system can come to 
the rescue: an AlignedArray (subtype of array) can be invented that guarantees 
alignment to 16 bytes even for its slices (so it just refuses to be sliced on 
odd indexed bounds), then inside the dotProduct() code a static if can test for 
such array and avoid the runtime test (branch) for 16 byte alignment.

Note that future CPUs can drop the requirement for 16 bytes alignments. If this 
happens then AlignedArray can become useless.

That dot5() seems to work, but it needs unittests.

The loop of dot5() is unrolled once, and each instruction works on two doubles, 
so essentially this code is unrolled four times. This means dot5() works only 
if the length of the input arrays is a multiple of 4. A bit of extra trailing 
or leading code (in asm or D, but D is probably enough) needs to be added to 
perform the mul+sum in the few pairs not managed by the main asm loop.

Maybe the D code inside dot5() can be improved a bit.

movapd is a SSE2 instruction, so a SSE2 detection branch can be added, this is 
very easy, you can test is with the sse2 function from the arraydouble module 
of the druntime:
private import core.cpuid;
bool sse2() { return cpuid == 3 && core.cpuid.sse2(); }
The small problem is of course that this is a runtime test, this can slow down 
code a bit when input arrays a small. It's not easy to move this test from 
runtime to compile-time (to compile a module with different versions of the 
function... this can require self-modifying code to change the "static" start 
address of the right function... Some compiler support can improve this 
situation.

Despite dot5() is almost three times faster than std.numeric.dotProduct for 
aligned arrays of about 1400 doubles, for small or very small arrays it's 
slower. So another runtime test (if) can be added inside the function to switch 
to a different algorithm (probably purely written in D) that works efficiently 
on very short arrays.


>Do we have your permission to replace dotProduct with yours?<

Of course :-) Improving std.numerics was my purpose from the beginning. But as 
you see dot5() isn't yet a replacement for dotProduct, some more work needs to 
be done even if dot() contains no bugs.

Bye,
bearophile

Reply via email to