On Fri, 18 Jan 2013, Vitor Sessak wrote: > On Wed, Jan 16, 2013 at 1:58 AM, Ronald S. Bultje <rsbul...@gmail.com> wrote: > >> +INIT_XMM sse >> +cglobal vorbis_inverse_coupling, 3, 3, 6, mag, ang, block_size >> + movsxdifnidn block_sizeq, block_sized >> + mova m5, [pdw_80000000] >> + lea magq, [magq+block_sizeq*4] >> + lea angq, [angq+block_sizeq*4] >> + neg block_sizeq >> +.loop: >> + mova m0, [magq+block_sizeq*4] >> + mova m1, [angq+block_sizeq*4] >> + xorps m2, m2 >> + xorps m3, m3 >> + cmpleps m2, m0 ; m <= 0.0 >> + cmpleps m3, m1 ; a <= 0.0 >> + andps m2, m5 ; keep only the sign bit > > Am I missing something or we can just do: > > andps m2, m0, m5 > > Instead of the xorps + cmpleps + andps?
.loop: mova m0, [magq+block_sizeq*4] mova m1, [angq+block_sizeq*4] xorps m4, m4 andps m2, m5, m0 ; sign(m) cmpnleps m4, m1 ; sign(a) xorps m1, m2 andps m3, m4, m1 andnps m4, m1 addps m3, m0 ; m = m + ((a < 0) & (a ^ sign(m))) subps m0, m4 ; a = m - ((a > 0) & (a ^ sign(m))) mova [magq+block_sizeq*4], m3 mova [angq+block_sizeq*4], m0 add block_sizeq, 4 jl .loop (Any change to the comments is intentional; the previous comment was wrong.) Unrelated to the above change, I measure your unmodified yasm version as 20% slower than the inline asm version on sandybridge. No idea why, and I've verified that the same instructions are in the inner loop. --Loren Merritt _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel