bearophile wrote:
A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it 
performs 8 mul+add each cycle. Again this code is slower for shorter arrays, 
but not much.

A version of the code with no unrolling (that performs only 4 mul+add each 
cycle) is a little better for shorter arrays (to create it you just need to 
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add 
only 16 to EDX each loop).

The asserts assert((cast(size_t)... can be replaced by a loop that performs the 
unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining 
aligned ones.

You already have a loop at the end that takes care of the stray elements. Why not move it to the beginning to take care of the stray elements _and_ unaligned elements in one shot?

Andrei

Reply via email to