On 2012-05-09, at 12:57 PM, Søren Sandmann wrote:

> Matt Turner <[email protected]> writes:
> 
>> I started porting my src_8888_0565 MMX function to SSE2, and in the
>> process started thinking about using SSE3+. The useful instructions
>> added post SSE2 that I see are
>>      SSE3:   lddqu - for unaligned loads across cache lines
> 
> I don't really understand that instruction. Isn't it identical to
> movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
> splits, but slower for plain old, non-cache split unaligned loads?

"The instructions movdqu, movups, movupd and lddqu are all able to read 
unaligned vectors. lddqu is faster than the alternatives on P4E and PM 
processors, but requires the SSE3 instruction set. The unaligned read 
instructions are relatively slow on older processors, but faster on Nehalem, 
Sandy Bridge and on future AMD and Intel processors."

>From http://www.agner.org/optimize/optimizing_assembly.pdf

-Jeff
_______________________________________________
Pixman mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/pixman

Reply via email to