On 25 August 2013 01:01, Ilya Yaroshenko <ilyayaroshe...@gmail.com> wrote:

> On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
>
>> movups is not good. It'll be a lot faster (and portable) if you use
>> movaps.
>>
>> Process looks something like:
>>   * do the first few from a[0] until a's alignment interval as scalar
>>   * load the left of b's aligned pair
>>   * loop for each aligned vector in a
>>     - load a[n..n+4] aligned
>>     - load the right of b's pair
>>     - combine left~right and shift left to match elements against a
>>     - left = right
>>   * perform stragglers as scalar
>>
>> Your benchmark is probably misleading too, because I suspect you are
>> passing directly alloc-ed arrays into the function (which are 16 byte
>> aligned).
>> movups will be significantly slower if the pointers supplied are not 16
>> byte aligned.
>> Also, results vary significantly between chip manufacturers and revisions.
>>
>
>
> I have tried to write fast implementation with aligned loads:
> 1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP
> instruction set (XOP available only for AMD).
> 2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128]
> with 16-bytes aligned arrays (previously iterates with a). It works slower
> then two vmovups (because loop tricks). Now I have 300 lines of slow
> dotProduct code =)
>

This if course depends largely on your processor too. What
processor/revision?
There is a massive difference between vendors and revisions.


> 4. Condition for small arrays works good.
>

Did you try putting this path selection logic in an outer function that the
compiler can inline?

Where's your new code with the movaps solution?

Reply via email to