Note that the BLAS dot product probably uses all sorts of tricks to squeeze 
the last cycle of SIMD performance out of the CPU.  e.g. here is the 
OpenBLAS ddot function for SandyBridge, which is hand-coded in assembly:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/ddot_microk_sandy-2.c

Getting the last 30% or so of this performance can be extremely tricky.

Reply via email to