> Also, all uses of SSE2 _mm_loadu_si128() intrinsics were upgraded to
> SSE3 _mm_lddqu_si128().
> The Intel Intrinsics Guide notes that it may perform better when the
> data crosses a cache line boundary.

It turns out _mm_lddqu_si128() is much slower than _mm_loadu_si128().
Would have been nice if the Intel Intrinsics Guide mentioned that.

Marked v4 patch as Not Applicable, and changed v3 patch back to New.

Reply via email to