Re: SIMD implementation of dot-product. Benchmarks

Ilya Yaroshenko Sat, 24 Aug 2013 08:05:48 -0700

On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:

movups is not good. It'll be a lot faster (and portable) if youuse movaps.
Process looks something like:
* do the first few from a[0] until a's alignment interval asscalar
  * load the left of b's aligned pair
  * loop for each aligned vector in a
    - load a[n..n+4] aligned
    - load the right of b's pair
- combine left~right and shift left to match elementsagainst a
    - left = right
  * perform stragglers as scalar
Your benchmark is probably misleading too, because I suspectyou arepassing directly alloc-ed arrays into the function (which are16 byte
aligned).
movups will be significantly slower if the pointers suppliedare not 16
byte aligned.
Also, results vary significantly between chip manufacturers andrevisions.



I have tried to write fast implementation with aligned loads:

1. I have now idea how to shift (rotate) 32-bytes avx vectorwithout XOP instruction set (XOP available only for AMD).2. I have tried to use one vmovaps and [one vmovups]/[twovinsertf128] with 16-bytes aligned arrays (previously iterateswith a). It works slower then two vmovups (because loop tricks).Now I have 300 lines of slow dotProduct code =)

4. Condition for small arrays works good.


I think it is better to use:

1. vmovups if it is available with condition for small arrays
2. version like from phobos if vmovups is not avalible
3. special version for small static size arrays

I think version for static size arrays can be easily done forphobos, processors can unroll such code. And dot productoptimized for complex numbers can be done too.


Best regards

Ilya

Re: SIMD implementation of dot-product. Benchmarks

Reply via email to