2012/12/22 Justin Ruggles <[email protected]>: > Also, I only see 3 gp registers being used, not 4. Looks like r1 is unused?
Yes, forgot to remove it because it was initially used as a loop counter. >> + add zq, 4 >> +.loop: >> + movu m0, [r2q] >> + movu m1, [zq ] [...] >> + mova [r3q + 0], m0 >> + mova [r3q + 16], m2 > > How can z be unaligned but r3 is aligned? Because I add 4 to z, and because I'm used to my cpu having weird issues with more complex addressing. And thus adding 4 to z and then using z is on average better than using z+4 - same for using a loop counter. However, I'll postpone that patch until I can get it to run as fast as the IEEE754 version... Yes I'm not kidding, even after further unrolling of the sse2 and fixing that IEEE754 function, the later is faster on Win64/Arrandale. -- Christophe _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
