On 25 August 2013 01:01, Ilya Yaroshenko <ilyayaroshe...@gmail.com> wrote:
> On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
>> movups is not good. It'll be a lot faster (and portable) if you use
>> Process looks something like:
>> * do the first few from a until a's alignment interval as scalar
>> * load the left of b's aligned pair
>> * loop for each aligned vector in a
>> - load a[n..n+4] aligned
>> - load the right of b's pair
>> - combine left~right and shift left to match elements against a
>> - left = right
>> * perform stragglers as scalar
>> Your benchmark is probably misleading too, because I suspect you are
>> passing directly alloc-ed arrays into the function (which are 16 byte
>> movups will be significantly slower if the pointers supplied are not 16
>> byte aligned.
>> Also, results vary significantly between chip manufacturers and revisions.
> I have tried to write fast implementation with aligned loads:
> 1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP
> instruction set (XOP available only for AMD).
> 2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128]
> with 16-bytes aligned arrays (previously iterates with a). It works slower
> then two vmovups (because loop tricks). Now I have 300 lines of slow
> dotProduct code =)
This if course depends largely on your processor too. What
There is a massive difference between vendors and revisions.
> 4. Condition for small arrays works good.
Did you try putting this path selection logic in an outer function that the
compiler can inline?
Where's your new code with the movaps solution?