Note that the BLAS dot product probably uses all sorts of tricks to squeeze the last cycle of SIMD performance out of the CPU. e.g. here is the OpenBLAS ddot function for SandyBridge, which is hand-coded in assembly:
https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/ddot_microk_sandy-2.c Getting the last 30% or so of this performance can be extremely tricky.
