Basically i'd like to have the cake and also eat it. With g++-4.2-20060805/cygwin on a k8 box on some software path with lots of sp float ops but no transcendentals or library calls -mfpmath=sse,387: 5.2 Mray/s -mfpmath=sse: 6 Mray/s That 15% performance difference is no surprise when you see things like 4037c8: flds 0x4(%esp) 4037cc: mulss %xmm5,%xmm2 4037d0: fsubrp %st,%st(1) 4037d2: movss %xmm1,0x4(%esp) 4037d8: addss 0x278(%esp,%ecx,4),%xmm0 4037e1: flds 0x4(%esp) 4037e5: fsubrp %st,%st(1) 4037e7: addss %xmm2,%xmm0 4037eb: movss %xmm0,0x4(%esp) 4037f1: flds 0x4(%esp) 4037f5: fdivrp %st,%st(1) 4037f7: fcomi %st(1),%st 4037f9: fldz 4037fb: setae %dl 4037fe: fcomip %st(1),%st 403800: seta %al 403803: or %al,%dl 403805: je 4036ca
Therefore -mfpmath=sse is the way to go and is in fact on par or better than what i get out of icc 9.1 for the same code. Where it gets ugly is when, for example, you throw some cosf() into the same compilation unit as with -mfpmath=sse you pay for some really really slow library function calls (at least on cygwin). Wishful thinking got me trying -march=k8 -mfpmath=sse -mfancy-math-387, to no avail :( Is there a way to enable such exotic codegen for 32bit environments?