[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-06 Thread ubizjak at gmail dot com


--- Comment #21 from ubizjak at gmail dot com  2007-04-06 07:37 ---
Strange things happen.

I have fully removed gcc build directory and bootstrapped gcc from scratch. To
my suprise, the difference with -msse and without -msse is now gone and
optimized dumps are now the same. For reference, the compiler has ident gcc
version 4.3.0 20070406 (experimental).

Regarding this bug - SSE performance vs x87 performance is clearly target
procesor dependant. There is nothing gcc can do, and even without memory acces,
SSE is slower than x87 on some targets (ref: Comment #5).

Let's close this bug as WONTFIX, as there is nothing to fix in gcc.


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||WONTFIX


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread bonzini at gnu dot org


--- Comment #13 from bonzini at gnu dot org  2007-04-05 11:01 ---
So this is an unstable sorting.  Adding dnovillo.


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 CC||dnovillo at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread ubizjak at gmail dot com


--- Comment #12 from ubizjak at gmail dot com  2007-04-05 11:00 ---
(In reply to comment #11)

 with -msse compile flag. Note different variable suffixes that create 
 different
 sort order. This is (IMO) due to fact that -msse enables lots of additional
 __builtin functions (these can be seen in 001.tu dump).

I forgot to add that -ffast-math simply enables more builtins, and again
different sort order is introduced.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread ubizjak at gmail dot com


--- Comment #11 from ubizjak at gmail dot com  2007-04-05 10:58 ---
(In reply to comment #10)
 I would look at the lreg output, which contains the results of regclass.

No, the difference is due to ssa pass that generates:

  # v1z_10 = PHI v1z_13(2), v1z_32(3)
  # v1y_9 = PHI v1y_12(2), v1y_31(3)
  # v1x_8 = PHI v1x_11(2), v1x_30(3)
  # i_7 = PHI i_17(2), i_36(3)
  # v3z_6 = PHI v3z_18(D)(2), v3z_29(3)
  # v3y_5 = PHI v3y_19(D)(2), v3y_26(3)
  # v3x_4 = PHI v3x_20(D)(2), v3x_23(3)
  # v2z_3 = PHI v2z_16(2), v2z_35(3)
  # v2y_2 = PHI v2y_15(2), v2y_34(3)
  # v2x_1 = PHI v2x_14(2), v2x_33(3)

without -msse and

  # v3z_10 = PHI v3z_18(D)(2), v3z_29(3)
  # v3y_9 = PHI v3y_19(D)(2), v3y_26(3)
  # v3x_8 = PHI v3x_20(D)(2), v3x_23(3)
  # v2z_7 = PHI v2z_16(2), v2z_35(3)
  # v2y_6 = PHI v2y_15(2), v2y_34(3)
  # v2x_5 = PHI v2x_14(2), v2x_33(3)
  # v1z_4 = PHI v1z_13(2), v1z_32(3)
  # v1y_3 = PHI v1y_12(2), v1y_31(3)
  # v1x_2 = PHI v1x_11(2), v1x_30(3)
  # i_1 = PHI i_17(2), i_36(3)

with -msse compile flag. Note different variable suffixes that create different
sort order. This is (IMO) due to fact that -msse enables lots of additional
__builtin functions (these can be seen in 001.tu dump). Since we don't have x87
scheduler the results became quite unpredictable, and depend on -msseX
settings. It just _happens_ that second form better suits stack nature of x87.

So, why does SSA pass have to interfere with computation dataflow? This
interferece makes things worse and effectively takes away user's control on the
flow of data.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread dnovillo at gcc dot gnu dot org


--- Comment #14 from dnovillo at gcc dot gnu dot org  2007-04-05 12:49 
---
(In reply to comment #11)

 So, why does SSA pass have to interfere with computation dataflow? This
 interferece makes things worse and effectively takes away user's control on 
 the
 flow of data.
 

Huh?  How is it relevant whether PHIs are in different order?  Conceptually,
the ordering of PHI nodes in a basic block is completely irrelevant.  Some pass
is getting confused when it shouldn't.  Transformations should not depend on
how PHI nodes are emitted in a block as all PHI nodes are always evaluated in
parallel.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread bonzini at gnu dot org


--- Comment #15 from bonzini at gnu dot org  2007-04-05 13:03 ---
Transformations do not, but out-of-SSA could.  Is there a way to ensure
ordering of PHI functions unlike what Uros's dumps suggest?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread dnovillo at redhat dot com


--- Comment #16 from dnovillo at redhat dot com  2007-04-05 13:15 ---
Subject: Re:  Floating point computation far slower
 for -mfpmath=sse

bonzini at gnu dot org wrote on 04/05/07 08:03:

 Is there a way to ensure ordering of PHI functions unlike what Uros's
 dumps suggest?

No.

I also don't see how PHI ordering would affect out-of-ssa.  It just
emits copies.  If the ordering of those copies is affecting things like
register pressure, then RA should be looked at.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread amacleod at redhat dot com


--- Comment #17 from amacleod at redhat dot com  2007-04-05 14:23 ---
Is the output from .optimized different?  (once the ssa versions numbers have
been stripped).   Those PHIs should be irrelevant, the question is whether the
different versioning has any effect.

The only way I can think that out-of-ssa could produce different results is if
it had to choose between two same-cost coalesces, and the versioning resulted
in them being in different places in the coalesce list.  Check the .optimized
output and if the code is equivalent, the problem is after that stage.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread ubizjak at gmail dot com


--- Comment #18 from ubizjak at gmail dot com  2007-04-05 16:39 ---
(In reply to comment #17)
 Is the output from .optimized different?  (once the ssa versions numbers have
 been stripped).   Those PHIs should be irrelevant, the question is whether the
 different versioning has any effect.
 
 The only way I can think that out-of-ssa could produce different results is if
 it had to choose between two same-cost coalesces, and the versioning resulted
 in them being in different places in the coalesce list.  Check the .optimized
 output and if the code is equivalent, the problem is after that stage.

They are _not_ equivalent. We have:

--cut here--
bb 2:
  __builtin_puts (Start?[0]);
  v2x = 0.0;
  v2y = 1.0e+0;
  v2z = 0.0;
  i = 0;
  v1x = 1.0e+0;
  v1y = 0.0;
  v1z = 0.0;

L0:;
  v3x = v1y * v2z - v1z * v2y;
  v3y = v1z * v2x - v1x * v2z;
  v3z = v1x * v2y - v1y * v2x;
  i = i + 1;
  v1z = v2z;
  v1y = v2y;
  v1x = v2x;
  v2z = v3z;
  v2y = v3y;
  v2x = v3x;
  if (i != 1) goto L0; else goto L2;

L2:;
  __builtin_puts (Stop![0]);
  printf (Result = %f, %f, %f\n[0], (double) v3x, (double) v3y, (double)
v3z);
  return 0;
--cut here--

=VS=

--cut here--
bb 2:
  __builtin_puts (Start?[0]);
  i = 0;
  v1x = 1.0e+0;
  v1y = 0.0;
  v1z = 0.0;
  v2x.43 = 0.0;
  v2y.44 = 1.0e+0;
  v2z.45 = 0.0;

L0:;
  v3x = v1y * v2z.45 - v1z * v2y.44;
  v3y = v1z * v2x.43 - v1x * v2z.45;
  v3z = v1x * v2y.44 - v1y * v2x.43;
  i = i + 1;
  v2z = v3z;
  v2y = v3y;
  v2x = v3x;
  v1z = v2z.45;
  v1y = v2y.44;
  v1x = v2x.43;
  if (i != 1) goto L8; else goto L2;

L8:;
  v2x.43 = v2x;
  v2y.44 = v2y;
  v2z.45 = v2z;
  goto bb 3 (L0);

L2:;
  __builtin_puts (Stop![0]);
  printf (Result = %f, %f, %f\n[0], (double) v3x, (double) v3y, (double)
v3z);
  return 0;
--cut here--


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread amacleod at redhat dot com


--- Comment #19 from amacleod at redhat dot com  2007-04-05 17:24 ---
what are you using for a compiler? Im using a mainline from mid march, and with
it, my .optimized files diff exactly the same, and I get the aforementioned
time differences in the executables.
(sse.c and sse-bad.c are same, just different names to get different output
files)

2007-03-13/gcc diff sse.c sse-bad.c

2007-03-13/gcc./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o
sse

2007-03-13/gcc./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4
-mfpmath=sse -o sse-bad

2007-03-13/gccls -l sse*optimized

-rw-rw-r--  1 amacleod amacleod 864 Apr  5 12:16 sse-bad.c.116t.optimized
-rw-rw-r--  1 amacleod amacleod 864 Apr  5 12:16 sse.c.116t.optimized

2007-03-13/gccdiff sse.c.116t.optimized sse-bad.c.116t.optimized

2007-03-13/gcctime ./sse

Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m0.630s
user0m0.572s
sys 0m0.000s

2007-03-13/gcctime ./sse-bad

Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m0.883s
user0m0.780s
sys 0m0.000s


Is this just with earlier compilers, what version are you using?  It at least
seems to indicate that the problem isn't before out-of-ssa since the time issue
is still there with identical outputs from .optimized


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-05 Thread ubizjak at gmail dot com


--- Comment #20 from ubizjak at gmail dot com  2007-04-05 19:39 ---
(In reply to comment #19)
 what are you using for a compiler? Im using a mainline from mid march, and 

gcc version 4.3.0 20070404 (experimental) on i686-pc-linux-gnu

with
 it, my .optimized files diff exactly the same, and I get the aforementioned
 time differences in the executables.

This is because -march=pentium4 enables all sse builtins for both cases.

 (sse.c and sse-bad.c are same, just different names to get different output
 files)
 
 2007-03-13/gcc diff sse.c sse-bad.c
 
 2007-03-13/gcc./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o
 sse
 
 2007-03-13/gcc./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4
 -mfpmath=sse -o sse-bad

This is known effect of SFmode SSE being slower than SFmode x87. But again, you
have enabled sse(2) builtins due to -march=pentium4.

Please try to compile using only -O2 and -O2 -msse. x87 math will be used
in both cases, but .optimized will show the difference. You can also try to
compile with and without -ffast-math.

IMO it is not acceptabe for tree dumps to depend on target compile flag in any
way...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-03 Thread bonzini at gnu dot org


--- Comment #8 from bonzini at gnu dot org  2007-04-03 12:43 ---
what's the generated code for -ffast-math? in principle i don't see a reason
why it should make any difference...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-03 Thread ubizjak at gmail dot com


--- Comment #9 from ubizjak at gmail dot com  2007-04-03 13:32 ---
(In reply to comment #8)
 what's the generated code for -ffast-math? in principle i don't see a reason
 why it should make any difference...

Trying to answer your question, I have played a bit with compile flags and
things are getting really strange:

[EMAIL PROTECTED] test]$ gcc -O2 -mfpmath=387 pr19780.c 
[EMAIL PROTECTED] test]$ time ./a.out
Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m1.211s
user0m1.212s
sys 0m0.004s
[EMAIL PROTECTED] test]$ gcc -O2 -mfpmath=387 -msse pr19780.c 
[EMAIL PROTECTED] test]$ time ./a.out
Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m0.555s
user0m0.552s
sys 0m0.004s

Note that -msse should have no effect on calculations. The difference between
asm dumps is:

--- pr19780.s   2007-04-03 14:28:14.0 +0200
+++ pr19780.s_  2007-04-03 14:28:01.0 +0200
@@ -17,69 +17,61 @@
pushl   %ebp
movl%esp, %ebp
pushl   %ecx
-   subl$84, %esp
+   subl$100, %esp
movl$.LC0, (%esp)
callputs
xorl%eax, %eax
-   fldz
fld1
fsts-16(%ebp)
+   fldz
+   fsts-12(%ebp)
+   fld %st(0)
fld %st(1)
-   fld %st(2)
-   fld %st(3)
jmp .L2
.p2align 4,,7
 .L7:
-   fstp%st(5)
-   fstp%st(0)
-   fxch%st(1)
-   fxch%st(2)
-   fxch%st(3)
-   fxch%st(4)
fxch%st(3)
+   fxch%st(2)
 .L2:
-   fld %st(1)
+   fld %st(2)
addl$1, %eax
-   fmul%st(3), %st
+   fmul%st(1), %st
cmpl$1, %eax
-   fstps   -12(%ebp)
+   flds-12(%ebp)
+   fmul%st(5), %st
+   fsubrp  %st, %st(1)
+   flds-12(%ebp)
+   fmul%st(3), %st
flds-16(%ebp)
-   fmul%st(1), %st
-   fsubrs  -12(%ebp)
-   fstps   -12(%ebp)
-   fmul%st(4), %st
-   fld %st(3)
fmul%st(3), %st
fsubrp  %st, %st(1)
flds-16(%ebp)
-   fmulp   %st, %st(4)
-   fxch%st(1)
+   fmul%st(6), %st
+   fxch%st(5)
fmul%st(4), %st
-   fsubrp  %st, %st(3)
-   flds-16(%ebp)
-   fld %st(3)
+   fsubrp  %st, %st(5)
fxch%st(2)
-   fsts-16(%ebp)
-   flds-12(%ebp)
+   fstps   -12(%ebp)
+   fxch%st(2)
+   fstps   -16(%ebp)
jne .L7
-   fstp%st(0)
-   fstp%st(5)
-   fstp%st(0)
-   fstp%st(0)
-   fstp%st(0)
+   fstp%st(3)
+   fxch%st(1)
movl$.LC3, (%esp)
fstps   -40(%ebp)
+   fxch%st(1)
fstps   -56(%ebp)
+   fstps   -72(%ebp)
callputs
flds-40(%ebp)
fstpl   20(%esp)
flds-56(%ebp)
fstpl   12(%esp)
-   flds-12(%ebp)
+   flds-72(%ebp)
fstpl   4(%esp)
movl$.LC4, (%esp)
callprintf
-   addl$84, %esp
+   addl$100, %esp
xorl%eax, %eax
popl%ecx
popl%ebp

where (+++) is with -msse.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2007-04-03 Thread bonzini at gnu dot org


--- Comment #10 from bonzini at gnu dot org  2007-04-03 13:36 ---
I would look at the lreg output, which contains the results of regclass.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2006-10-25 Thread uros at kss-loka dot si


--- Comment #6 from uros at kss-loka dot si  2006-10-25 12:04 ---
(In reply to comment #5)
 With more registers (x86_64) the stack moves are gone, but: (!)

 (testing done on AMD Athlon fam 15 model 35 stepping 2)

On Xeon 3.6, SSE is now faster:

gcc -O2 -march=pentium4 -mfpmath=387 pr19780.c 
time ./a.out
Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m0.805s
user0m0.804s
sys 0m0.000s

gcc -O2 -march=pentium4 -mfpmath=sse pr19780.c 
time ./a.out
Start?
Stop!
Result = 0.00, 0.00, 1.00

real0m0.707s
user0m0.704s
sys 0m0.004s

vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  : Intel(R) Xeon(TM) CPU 3.60GHz
stepping: 10
cpu MHz : 3600.970
cache size  : 2048 KB

The question is now, why is Athlon so slow with SFmode SSE?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2006-10-25 Thread uros at kss-loka dot si


--- Comment #7 from uros at kss-loka dot si  2006-10-25 12:18 ---
(In reply to comment #6)

 On Xeon 3.6, SSE is now faster:

... but for -ffast-math:

SSE: user0m0.756s
x87: user0m0.612s

Yes, x87 is faster for -ffast-math by some 20%.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2006-10-24 Thread rguenth at gcc dot gnu dot org


--- Comment #5 from rguenth at gcc dot gnu dot org  2006-10-24 13:28 ---
With more registers (x86_64) the stack moves are gone, but: (!)

[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc ./xgcc -B. -O2 -o t t.c
-mfpmath=387
[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc /usr/bin/time ./t
Start?
Stop!
Result = 0.00, 0.00, 1.00
5.31user 0.00system 0:05.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc ./xgcc -B. -O2 -o t t.c
[EMAIL PROTECTED]:/abuild/rguenther/trunk-g/gcc /usr/bin/time ./t
Start?
Stop!
Result = 0.00, 0.00, 1.00
9.96user 0.05system 0:10.06elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps

that is almost twice as fast with 387 math than with SSE math on x86_64!

The inner loop is

.L7:
movaps  %xmm3, %xmm6
movaps  %xmm1, %xmm5
movaps  %xmm0, %xmm4
.L2:
movaps  %xmm2, %xmm3
mulss   %xmm6, %xmm2
movaps  %xmm7, %xmm0
addl$1, %eax
mulss   %xmm4, %xmm3
movaps  %xmm7, %xmm1
mulss   %xmm5, %xmm0
cmpl$10, %eax
mulss   %xmm6, %xmm1
movaps  %xmm4, %xmm7
subss   %xmm0, %xmm3
movaps  %xmm8, %xmm0
mulss   %xmm4, %xmm0
subss   %xmm0, %xmm1
movaps  %xmm8, %xmm0
movaps  %xmm6, %xmm8
mulss   %xmm5, %xmm0
subss   %xmm2, %xmm0
movaps  %xmm5, %xmm2
jne .L7

vs.

.L7:
fxch%st(3)
fxch%st(2)
.L2:
fld %st(2)
addl$1, %eax
cmpl$10, %eax
fmul%st(1), %st
flds76(%rsp)
fmul%st(5), %st
fsubrp  %st, %st(1)
flds76(%rsp)
fmul%st(3), %st
flds72(%rsp)
fmul%st(3), %st
fsubrp  %st, %st(1)
flds72(%rsp)
fmul%st(6), %st
fxch%st(5)
fmul%st(4), %st
fsubrp  %st, %st(5)
fxch%st(2)
fstps   76(%rsp)
fxch%st(2)
fstps   72(%rsp)
jne .L7

(testing done on AMD Athlon fam 15 model 35 stepping 2)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2006-08-11 Thread bonzini at gnu dot org


--- Comment #4 from bonzini at gnu dot org  2006-08-11 10:22 ---
Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31. 
Not that we can blame GCC for using 12, but it is not a fair comparison. :-)

In fact, 8 registers are enough, but it is quite tricky to obtain them.
The problem is that v3[xyz] is live across multiple BB's, making the task of
the register allocator quite harder.  Even if we change v3[xyz] in the printf
to v2[xyz], cfg-cleanup (between vrp1 and dce2) replaces it and, in doing so,
it extends the lifetime of v3[xyz].

(Since it's all about having short lifetimes, CCing [EMAIL PROTECTED])

BTW, here is the optimal code (if it works...):

ENTER basic block: v1[xyz], v2[xyz] are live (6 registers)

  v3x = v1y * v2z - v1z * v2y;

v3x is now live, and it takes 2 registers to compute this statement.  Here we
hit a maximum of 8 live registers.  After the statement 7 registers are live.

  v3y = v1z * v2x - v1x * v2z;

v1z dies here, so we need only one additional register for this statement.  We
also hit a maximum of 8 live registers.  At the end of the statement, 7
registers are also live (7 - 1 v1z that dies + 1 for v3y)

  v3z = v1x * v2y - v1y * v2x;

Likewise, v1x and v1y die, so we need 7 registers and, at the end of the
statement, 6 registers are also live.

Optimal code would be like this (%xmm0..2 = v1[xyz], %xmm3..5 = v2[xyz])

v3x = v1y * v2z - v1z * v2y
  movss %xmm1, %xmm6
  mulss %xmm5, %xmm6 ;; v1y * v2z in %xmm6
  movss %xmm2, %xmm7
  mulss %xmm4, %xmm7 ;; v1z * v2y in %xmm7
  subss %xmm7, %xmm6 ;; v3x in %xmm6

v3y = v1z * v2x - v1x * v2z
  mulss %xmm3, %xmm2 ;; v1z dies, v1z * v2x in %xmm2
  movss %xmm1, %xmm7
  mulss %xmm5, %xmm7 ;; v1x * v2z in %xmm7
  subss %xmm7, %xmm2 ;; v3y in %xmm2

v3z = v1x * v2y - v1y * v2x
  mulss %xmm4, %xmm0 ;; v1x dies, v1x * x2y in %xmm0
  mulss %xmm3, %xmm1 ;; v1y dies, v1y * v2x in %xmm1
  subss %xmm1, %xmm0 ;; v3z in %xmm0

Note now how we should reorder the final moves to obtain optimal code!

  movss %xmm0, %xmm7 ;; save v3z... alternatively, do it before the subss

  movss %xmm3, %xmm0 ;; v1x = v2x
  movss %xmm6, %xmm3 ;; v2x = v3x (in %xmm6)
  movss %xmm4, %xmm1 ;; v1y = v2y
  movss %xmm2, %xmm4 ;; v2y = v3y (in %xmm2)
  movss %xmm5, %xmm2 ;; v1z = v2z
  movss %xmm7, %xmm5 ;; v2z = v3z (saved in %xmm7)

(Note that doing the reordering manually does not help...) :-(  Out of
curiosity, can somebody check out yara-branch to see how it fares?


---

By comparison, the x87 is relatively easier, because there are never more than
8 registers and fxch makes it much easier to write the compensation code:

v3x = v1y * v2z - v1z * v2y
;; v1x v1y v1z v2x v2y v2z
   fld %st(1)   ;; v1y v1x v1y v1z v2x v2y v2z
   fmul %st(6), %st(0)  ;; v1y*v2z v1x v1y v1z v2x v2y v2z
   fld %st(3)   ;; v1z v1y*v2z v1x v1y v1z v2x v2y v2z
   fmul %st(6), %st(0)  ;; v1z*v2y v1y*v2z v1x v1y v1z v2x v2y v2z
   fsubp %st(0), %st(1) ;; v3x v1x v1y v1z v2x v2y v2z

v3y = v1z * v2x - v1x * v2z
   fld %st(4)   ;; v2x v3x v1x v1y v1z v2x v2y v2z
   fmulp %st(0), %st(4) ;; v3x v1x v1y v1z*v2x v2x v2y v2z
   fld %st(1)   ;; v1x v3x v1x v1y v1z*v2x v2x v2y v2z
   fmul %st(7), %st(0)  ;; v1x*v2z v3x v1x v1y v1z*v2x v2x v2y v2z
   fsubp %st(0), %st(4) ;; v3x v1x v1y v3y v2x v2y v2z

v3z = v1x * v2y - v1y * v2x
   fld %st(5)   ;; v2y v3x v1x v1y v3y v2x v2y v2z
   fmulp %st(0), %st(2) ;; v3x v1x*v2y v1y v3y v2x v2y v2z
   fld %st(4)   ;; v2x v3x v1x*v2y v1y v3y v2x v2y v2z
   fmul %st(3), %st(0)  ;; v1y*v2x v3x v1x*v2y v1y v3y v2x v2y v2z
   fsubp %st(0), %st(2) ;; v3x v3z v1y v3y v2x v2y v2z
   fstp %st(2)  ;; v3z v3x v3y v2x v2y v2z

   fxch %st(5)  ;; v2z v3x v3y v2x v2y v3z
   fxch %st(2)  ;; v3y v3x v2z v2x v2y v3z
   fxch %st(4)  ;; v2y v3x v2z v2x v3y v3z
   fxch %st(1)  ;; v3x v2y v2z v2x v3y v3z
   fxch %st(0)  ;; v2x v2y v2z v3x v3y v3z

(well, the fxch should be scheduled, but still it is possible to do it without
spilling).

Paolo


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 CC||amacleod at redhat dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780



[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2005-09-28 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2005-09-29 
04:06 ---
Oh, and this looks very related to two operand instructions issue.
PPC gives optimial code:
L2:
fmul f0,f6,f9
fmul f13,f7,f10
fmul f12,f8,f11
fmsub f29,f8,f10,f0
fmsub f30,f6,f11,f13
fmsub f31,f7,f9,f12
fmr f6,f10
fmr f7,f11
fmr f8,f9
fmr f10,f31
fmr f11,f29
fmr f9,f30
bdnz L2


-- 
   What|Removed |Added

   Severity|normal  |enhancement


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780


[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

2005-09-28 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2005-09-29 
04:05 ---
Confirmed.  This is weird and this is an ra issue.  I don't understand why the 
ra is spilling it to the stack 
as there are enough SSE registers to hold the 6 registers.

-- 
   What|Removed |Added

 Status|UNCONFIRMED |NEW
  Component|target  |rtl-optimization
 Ever Confirmed||1
 GCC target triplet|i686-pc-linux-gnu   |i686-*-*
   Keywords||ra
   Last reconfirmed|-00-00 00:00:00 |2005-09-29 04:05:34
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780