On 25 August 2013 01:01, Ilya Yaroshenko <[email protected]> wrote:
> On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote: > >> movups is not good. It'll be a lot faster (and portable) if you use >> movaps. >> >> Process looks something like: >> * do the first few from a[0] until a's alignment interval as scalar >> * load the left of b's aligned pair >> * loop for each aligned vector in a >> - load a[n..n+4] aligned >> - load the right of b's pair >> - combine left~right and shift left to match elements against a >> - left = right >> * perform stragglers as scalar >> >> Your benchmark is probably misleading too, because I suspect you are >> passing directly alloc-ed arrays into the function (which are 16 byte >> aligned). >> movups will be significantly slower if the pointers supplied are not 16 >> byte aligned. >> Also, results vary significantly between chip manufacturers and revisions. >> > > > I have tried to write fast implementation with aligned loads: > 1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP > instruction set (XOP available only for AMD). > 2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128] > with 16-bytes aligned arrays (previously iterates with a). It works slower > then two vmovups (because loop tricks). Now I have 300 lines of slow > dotProduct code =) > This if course depends largely on your processor too. What processor/revision? There is a massive difference between vendors and revisions. > 4. Condition for small arrays works good. > Did you try putting this path selection logic in an outer function that the compiler can inline? Where's your new code with the movaps solution?
