On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
On 18 August 2013 14:39, Ilya Yaroshenko
<[email protected]> wrote:
On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in
this context
are
faster than performing scalar operations), and possibly
inefficient on x86
too.
dotProduct uses unaligned loads (__builtin_ia32_loadups256,
__builtin_ia32_loadupd256) and it up to 21 times faster then
trivial scalar
version.
Why unaligned loads is not-portable and inefficient?
x86 is the only arch that can perform an unaligned load. And
even on x86
(many implementations) it's not very efficient.
:(
To make it account for potentially random alignment will be
awkward, but it
might be possible to do efficiently.
Did you mean use unaligned loads or prepare data for alignment
loads at
the beginning of function?
I mean to only use aligned loads, in whatever way that happens
to work out.
The hard case is when the 2 arrays have different start offsets.
Otherwise you need to wrap your code in a version(x86) block.
Thanks!