On Fri, Jun 1, 2012 at 2:26 PM, Ronald S. Bultje <[email protected]> wrote: > Hi, > > On Tue, May 1, 2012 at 1:49 PM, Justin Ruggles <[email protected]> > wrote: >> +.loop: >> + mulps m0, m4, [srcq+2*lenq ] >> + mulps m1, m4, [srcq+2*lenq+1*mmsize] >> + mulps m2, m4, [srcq+2*lenq+2*mmsize] >> + mulps m3, m4, [srcq+2*lenq+3*mmsize] >> + cvtps2dq m0, m0 >> + cvtps2dq m1, m1 >> + cvtps2dq m2, m2 >> + cvtps2dq m3, m3 > > Is this (load+mul in same instruction) actually faster than load x4, > followed by mul x4? The load latency may make this slower, even though > it's less instructions.
Fewer instructions can't be worse. At worst, they're broken up into uops internally -- it's still less instruction decode bandwidth. Jason _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
