> Also, all uses of SSE2 _mm_loadu_si128() intrinsics were upgraded to > SSE3 _mm_lddqu_si128(). > The Intel Intrinsics Guide notes that it may perform better when the > data crosses a cache line boundary.
It turns out _mm_lddqu_si128() is much slower than _mm_loadu_si128(). Would have been nice if the Intel Intrinsics Guide mentioned that. Marked v4 patch as Not Applicable, and changed v3 patch back to New.

