On Fri, Jun 1, 2012 at 2:26 PM, Ronald S. Bultje <[email protected]> wrote:
> Hi,
>
> On Tue, May 1, 2012 at 1:49 PM, Justin Ruggles <[email protected]> 
> wrote:
>> +.loop:
>> +    mulps     m0, m4, [srcq+2*lenq         ]
>> +    mulps     m1, m4, [srcq+2*lenq+1*mmsize]
>> +    mulps     m2, m4, [srcq+2*lenq+2*mmsize]
>> +    mulps     m3, m4, [srcq+2*lenq+3*mmsize]
>> +    cvtps2dq  m0, m0
>> +    cvtps2dq  m1, m1
>> +    cvtps2dq  m2, m2
>> +    cvtps2dq  m3, m3
>
> Is this (load+mul in same instruction) actually faster than load x4,
> followed by mul x4? The load latency may make this slower, even though
> it's less instructions.

Fewer instructions can't be worse.  At worst, they're broken up into
uops internally -- it's still less instruction decode bandwidth.

Jason
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to