On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in this
faster than performing scalar operations), and possibly
inefficient on x86
dotProduct uses unaligned loads (__builtin_ia32_loadups256,
__builtin_ia32_loadupd256) and it up to 21 times faster then
trivial scalar version.
Why unaligned loads is not-portable and inefficient?
To make it account for potentially random alignment will be
awkward, but it
might be possible to do efficiently.
Did you mean use unaligned loads or prepare data for alignment
loads at the beginning of function?