bearophile wrote:
A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it
performs 8 mul+add each cycle. Again this code is slower for shorter arrays,
but not much.
A version of the code with no unrolling (that performs only 4 mul+add each
cycle) is a little better for shorter arrays (to create it you just need to
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add
only 16 to EDX each loop).
The asserts assert((cast(size_t)... can be replaced by a loop that performs the
unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining
aligned ones.
You already have a loop at the end that takes care of the stray
elements. Why not move it to the beginning to take care of the stray
elements _and_ unaligned elements in one shot?
Andrei