On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
movups is not good. It'll be a lot faster (and portable) if you
use movaps.
Process looks something like:
* do the first few from a[0] until a's alignment interval as
scalar
* load the left of b's aligned pair
* loop for each aligned vector in a
- load a[n..n+4] aligned
- load the right of b's pair
- combine left~right and shift left to match elements
against a
- left = right
* perform stragglers as scalar
Your benchmark is probably misleading too, because I suspect
you are
passing directly alloc-ed arrays into the function (which are
16 byte
aligned).
movups will be significantly slower if the pointers supplied
are not 16
byte aligned.
Also, results vary significantly between chip manufacturers and
revisions.
I`ll try =). Thanks you very math!