2012/12/22 Justin Ruggles <[email protected]>:
> Also, I only see 3 gp registers being used, not 4. Looks like r1 is unused?

Yes, forgot to remove it because it was initially used as a loop counter.

>> +    add        zq, 4
>> +.loop:
>> +    movu       m0, [r2q]
>> +    movu       m1, [zq ]
[...]
>> +    mova  [r3q +  0], m0
>> +    mova  [r3q + 16], m2
>
> How can z be unaligned but r3 is aligned?

Because I add 4 to z, and because I'm used to my cpu having weird
issues with more complex addressing. And thus adding 4 to z and then
using z is on average better than using z+4 - same for using a loop
counter.

However, I'll postpone that patch until I can get it to run as fast as
the IEEE754 version... Yes I'm not kidding, even after further
unrolling of the sse2 and fixing that IEEE754 function, the later is
faster on Win64/Arrandale.

-- 
Christophe
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to