On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
movups is not good. It'll be a lot faster (and portable) if you
Process looks something like:
* do the first few from a until a's alignment interval as
* load the left of b's aligned pair
* loop for each aligned vector in a
- load a[n..n+4] aligned
- load the right of b's pair
- combine left~right and shift left to match elements
- left = right
* perform stragglers as scalar
Your benchmark is probably misleading too, because I suspect
passing directly alloc-ed arrays into the function (which are
movups will be significantly slower if the pointers supplied
are not 16
Also, results vary significantly between chip manufacturers and
I have tried to write fast implementation with aligned loads:
1. I have now idea how to shift (rotate) 32-bytes avx vector
without XOP instruction set (XOP available only for AMD).
2. I have tried to use one vmovaps and [one vmovups]/[two
vinsertf128] with 16-bytes aligned arrays (previously iterates
with a). It works slower then two vmovups (because loop tricks).
Now I have 300 lines of slow dotProduct code =)
4. Condition for small arrays works good.
I think it is better to use:
1. vmovups if it is available with condition for small arrays
2. version like from phobos if vmovups is not avalible
3. special version for small static size arrays
I think version for static size arrays can be easily done for
phobos, processors can unroll such code. And dot product
optimized for complex numbers can be done too.