On Fri, Feb 15, 2013 at 6:37 AM, "René J.V. Bertin" <[email protected]> wrote:
> On my 2.7Ghz dual-core i7 MBP, I get about 10000Hz for the SSE version, and 
> roughly half that for the generic, scalar function, using gcc-4.2 as well as 
> using MSVC 2010 Express running under WinXP in VirtualBox. The factor 2 speed 
> gain for SSE code also applies on 2 AMD machines (mid-end laptop and C62 
> netbook).
>
> Then I installed a new mingw32 cross-compiler based on gcc 4.7 and for the 
> heck of it compiled my benchmark with it ... and found same factor 2 ... but 
> in favour of the scalar code, on my i7 . It's more like a factor 2.5, 
> actually. Same thing after installing the native OS X gcc 4.7 version.
>
> The question: is gcc-4.7 clever enough to do a better optimisation of the 2nd 
> benchmark loop than the 1st loop, or does it really generate so much better 
> assembly from the scalar function? NB, -fno-inline-functions has no effect 
> here.


gcc 4.7 is clever enough to generate SSE code by itself. Maybe that's
what you're experiencing. I guess compiler flags do matter too.

Have you inspected the generated assembly code? gcc -S should tell you
exactly the difference between the two loops, and I found it a very
informative exercise to inspect it when something goes hinky
performance-wise. Especially since you've used inline assembler for
gcc, which tends to inhibit many of its other optimizations. Why don't
you try gcc's vector primitives instead?
_______________________________________________
Libav-user mailing list
[email protected]
http://ffmpeg.org/mailman/listinfo/libav-user

Reply via email to