fancy x87 ops, SSE and -mfpmath=sse,387 performance

tbp Sat, 05 Aug 2006 23:53:01 -0700

Basically i'd like to have the cake and also eat it.

With g++-4.2-20060805/cygwin on a k8 box on some software path with
lots of sp float ops but no transcendentals or library calls
-mfpmath=sse,387: 5.2 Mray/s
-mfpmath=sse: 6 Mray/s
That 15% performance difference is no surprise when you see things like
 4037c8:       flds   0x4(%esp)
 4037cc:       mulss  %xmm5,%xmm2
 4037d0:       fsubrp %st,%st(1)
 4037d2:       movss  %xmm1,0x4(%esp)
 4037d8:       addss  0x278(%esp,%ecx,4),%xmm0
 4037e1:       flds   0x4(%esp)
 4037e5:       fsubrp %st,%st(1)
 4037e7:       addss  %xmm2,%xmm0
 4037eb:       movss  %xmm0,0x4(%esp)
 4037f1:       flds   0x4(%esp)
 4037f5:       fdivrp %st,%st(1)
 4037f7:       fcomi  %st(1),%st
 4037f9:       fldz
 4037fb:       setae  %dl
 4037fe:       fcomip %st(1),%st
 403800:       seta   %al
 403803:       or     %al,%dl
 403805:       je     4036ca


Therefore -mfpmath=sse is the way to go and is in fact on par or
better than what i get out of icc 9.1 for the same code.
Where it gets ugly is when, for example, you throw some cosf() into
the same compilation unit as with -mfpmath=sse you pay for some really
really slow library function calls (at least on cygwin).
Wishful thinking got me trying -march=k8 -mfpmath=sse
-mfancy-math-387, to no avail :(
Is there a way to enable such exotic codegen for 32bit environments?

fancy x87 ops, SSE and -mfpmath=sse,387 performance

Reply via email to