On Fri, 18 Jan 2013, Vitor Sessak wrote:
> On Wed, Jan 16, 2013 at 1:58 AM, Ronald S. Bultje <rsbul...@gmail.com> wrote:
>
>> +INIT_XMM sse
>> +cglobal vorbis_inverse_coupling, 3, 3, 6, mag, ang, block_size
>> +    movsxdifnidn    block_sizeq, block_sized
>> +    mova                     m5, [pdw_80000000]
>> +    lea                    magq, [magq+block_sizeq*4]
>> +    lea                    angq, [angq+block_sizeq*4]
>> +    neg             block_sizeq
>> +.loop:
>> +    mova                     m0, [magq+block_sizeq*4]
>> +    mova                     m1, [angq+block_sizeq*4]
>> +    xorps                    m2, m2
>> +    xorps                    m3, m3
>> +    cmpleps                  m2, m0     ; m <= 0.0
>> +    cmpleps                  m3, m1     ; a <= 0.0
>> +    andps                    m2, m5     ; keep only the sign bit
>
> Am I missing something or we can just do:
>
> andps m2, m0, m5
>
> Instead of the xorps + cmpleps + andps?

.loop:
    mova     m0, [magq+block_sizeq*4]
    mova     m1, [angq+block_sizeq*4]
    xorps    m4, m4
    andps    m2, m5, m0 ; sign(m)
    cmpnleps m4, m1     ; sign(a)
    xorps    m1, m2
    andps    m3, m4, m1
    andnps   m4, m1
    addps    m3, m0     ; m = m + ((a < 0) & (a ^ sign(m)))
    subps    m0, m4     ; a = m - ((a > 0) & (a ^ sign(m)))
    mova   [magq+block_sizeq*4], m3
    mova   [angq+block_sizeq*4], m0
    add    block_sizeq, 4
    jl .loop

(Any change to the comments is intentional; the previous comment was
wrong.)


Unrelated to the above change, I measure your unmodified yasm version as
20% slower than the inline asm version on sandybridge. No idea why, and
I've verified that the same instructions are in the inner loop.

--Loren Merritt
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to