[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-12-07 Thread rguenth at gcc dot gnu dot org


--- Comment #41 from rguenth at gcc dot gnu dot org  2008-12-07 13:00 
---
There's not much to be done for aliasing - everything points to global memory
and thus aliases.  There may be some opportunities for offset-based
disambiguations
via pointers, but I didn't investigate in detail.  Whoever wants someone to
work on specific details needs to provide way shorter testcases ;)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-12-06 Thread lucier at math dot purdue dot edu


--- Comment #39 from lucier at math dot purdue dot edu  2008-12-06 16:37 
---
I may have narrowed down the problem a bit.

With this compiler (revision 118491):

pythagoras-277% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061105 (experimental)

one gets (on a faster machine than previous reports)

(time (direct-fft-recursive-4 a table))
133 ms real time
140 ms cpu time (140 user, 0 system)
no collections
64 bytes allocated
no minor faults
no major faults

With this compiler (revision 118474):

pythagoras-24% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061104 (experimental)

one gets

(time (direct-fft-recursive-4 a table))
116 ms real time
108 ms cpu time (108 user, 0 system)
no collections
64 bytes allocated
no minor faults
no major faults

and you see the typical problem with assembly code from direct.i with the later
compiler.

Paolo may have been right about fwprop, this patch was installed that day:

Author: bonzini
Date: Sat Nov  4 08:36:45 2006
New Revision: 118475

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=118475
Log:
2006-11-03  Paolo Bonzini  [EMAIL PROTECTED]
Steven Bosscher  [EMAIL PROTECTED]

* fwprop.c: New file.
* Makefile.in: Add fwprop.o.
* tree-pass.h (pass_rtl_fwprop, pass_rtl_fwprop_with_addr): New.
* passes.c (init_optimization_passes): Schedule forward propagation.
* rtlanal.c (loc_mentioned_in_p): Support NULL value of the second
parameter.
* timevar.def (TV_FWPROP): New.
* common.opt (-fforward-propagate): New.
* opts.c (decode_options): Enable forward propagation at -O2.
* gcse.c (one_cprop_pass): Do not run local cprop unless touching
jumps.
* cse.c (fold_rtx_subreg, fold_rtx_mem, fold_rtx_mem_1, find_best_addr,
canon_for_address, table_size): Remove.
(new_basic_block, insert, remove_from_table): Remove references to
table_size.
(fold_rtx): Process SUBREGs and MEMs with equiv_constant, make
simplification loop more straightforward by not calling fold_rtx
recursively.
(equiv_constant): Move here a small part of fold_rtx_subreg,
do not call fold_rtx.  Call avoid_constant_pool_reference
to process MEMs.
* recog.c (canonicalize_change_group): New.
* recog.h (canonicalize_change_group): New.

* doc/invoke.texi (Optimization Options): Document fwprop.
* doc/passes.texi (RTL passes): Document fwprop.


Added:
trunk/gcc/fwprop.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/Makefile.in
trunk/gcc/common.opt
trunk/gcc/cse.c
trunk/gcc/doc/invoke.texi
trunk/gcc/doc/passes.texi
trunk/gcc/gcse.c
trunk/gcc/opts.c
trunk/gcc/passes.c
trunk/gcc/recog.c
trunk/gcc/recog.h
trunk/gcc/rtlanal.c
trunk/gcc/timevar.def
trunk/gcc/tree-pass.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-12-06 Thread bonzini at gnu dot org


--- Comment #40 from bonzini at gnu dot org  2008-12-07 02:55 ---
IIUC this is a typical case in which CSE was fixing something that earlier
passes messed up.  Unfortunately fwprop does (better) what CSE was meant to do,
but does not do what I assumed was already done before CSE.

If the problem is aliasing/FRE, then I think Richi is the one who could fix it
for good in the tree passes.  If there is more to it, however, I can take a
look at why fwprop is generating the ugly code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-09-04 Thread lucier at math dot purdue dot edu


--- Comment #36 from lucier at math dot purdue dot edu  2008-09-04 20:39 
---
I don't really understand the status of this bug.

Before 4.3.0, it was P!, and Mark said he said he'd like to see us start to
explain these kinds of dramatic performance changes.

There was quite a bit of detective work that ended with for some reason
gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump)


Richard opined that it was an alias partitioning problem, but Uros noted that
for the original code instead of the reduced testcase expanding some parameter
to its maximum still doesn't fix the problem.

So (a) we don't know what the current code is doing wrong, and (b) we don't
know why 4.2 got it right.

So I don't think Mark got what he wanted, and now it's P2, and each release the
target release for fixing it gets pushed back.

I've been testing mainline on this bug sporadically, especially when an entry
in gcc-patches mentions some words that also appear on this PR, to see if it's
fixed.  I'm a bit concerned that the target of 4.3.* is becoming increasingly
out of reach, as changes committed to that branch seem to be more and more
conservative because it's a release branch.

I don't think the code for this bug is terribly atypical for machine-generated
code; it would be nice to be able to remove this performance regression. 
Unfortunately, I'm in no position to do so.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-09-04 Thread rguenth at gcc dot gnu dot org


--- Comment #37 from rguenth at gcc dot gnu dot org  2008-09-04 20:43 
---
We have to admit that this bug is unlikely to get fixed in the 4.3 series.
It still lacks proper analysis, as unfortunately that done on the shorter
testcase was not valid.  Analysis takes time, and honestly at this point I
rather spend time fixing wrong-code or ice-on-valid bugs.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-09-04 Thread lucier at math dot purdue dot edu


--- Comment #38 from lucier at math dot purdue dot edu  2008-09-04 20:49 
---
OK, but I was moved to write because Jakub's latest 4.4 status report requests

Please concentrate now on fixing bugs, especially the performance regressions.

and this is a definite 4.3/4.4 performance regression from 4.2.  (How many of
the P1 PRs are performance regressions?)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-08-27 Thread jsm28 at gcc dot gnu dot org


--- Comment #35 from jsm28 at gcc dot gnu dot org  2008-08-27 22:02 ---
4.3.2 is released, changing milestones to 4.3.3.


-- 

jsm28 at gcc dot gnu dot org changed:

   What|Removed |Added

   Target Milestone|4.3.2   |4.3.3


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-07-09 Thread lucier at math dot purdue dot edu


--- Comment #34 from lucier at math dot purdue dot edu  2008-07-09 16:05 
---
Problem still exists with

euler-18% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--with-gmp=/pkgs/gmp-4.2.2/ --with-mpfr=/pkgs/gmp-4.2.2/
--prefix=/pkgs/gcc-mainline --enable-languages=c
--enable-gather-detailed-mem-stats
Thread model: posix
gcc version 4.4.0 20080708 (experimental) [trunk revision 137644] (GCC) 

Just checking whether recent changes happened to fix it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-05-30 Thread lucier at math dot purdue dot edu


--- Comment #32 from lucier at math dot purdue dot edu  2008-05-30 16:01 
---
I've decided to test the current ira branch with this problem.  I used the
build instructions in comment 24.

With -fno-ira I get the same results as with 4.3.0 (no surprise there).

With -fira I get the time

(time (direct-fft-recursive-4 a table))
422 ms real time
421 ms cpu time (421 user, 0 system)
no collections
64 bytes allocated
no minor faults
no major faults

which is an improvement, and the code at the beginning of the loop is

.L7262:
movq%rdx, %rcx
addq(%rsi), %rcx
leaq4(%rdx), %r15
movq%rcx, (%rbx)
addq$4, %rcx
movq%rcx, (%rbp)
movq(%rbx), %rcx
addq(%rsi), %rcx
movq%rcx, (%rdi)
addq$4, %rcx
movq%rcx, (%r8)
movq(%rdi), %rcx
addq(%rsi), %rcx
leaq4(%rcx), %r10
movq%rcx, (%r9)
movq%r10, (%r13)
movq(%rax), %rcx
addq$7, %rcx
movsd   (%rcx,%r10,2), %xmm4
movq(%r9), %r10
leaq(%rcx,%rdx,2), %r11
addq$8, %rdx
movsd   (%r11), %xmm11
movsd   (%rcx,%r10,2), %xmm5
movq(%r8), %r10 
movsd   (%rcx,%r10,2), %xmm6
movq(%rdi), %r10
movsd   (%rcx,%r10,2), %xmm7
movq(%rbp), %r10
movsd   (%rcx,%r10,2), %xmm8
movq(%rbx), %r10
movapd  %xmm8, %xmm14
movsd   (%rcx,%r10,2), %xmm9
leaq(%r15,%r15), %r10
movsd   (%rcx,%r10), %xmm10
movq(%r12), %rcx
movapd  %xmm9, %xmm15
movsd   15(%rcx), %xmm1
movsd   7(%rcx), %xmm2
movapd  %xmm1, %xmm13
movsd   31(%rcx), %xmm3
movapd  %xmm2, %xmm12

which is also an improvement, but it still is nowhere near the result for
4.2.2.

So, whatever is causing this problem, it appears the new register allocator
isn't going to fix it.

The code generated by today's mainline (136210) isn't better than 4.3.0; the
time is

(time (direct-fft-recursive-4 a table))
469 ms real time
469 ms cpu time (469 user, 0 system)
no collections
64 bytes allocated
no minor faults
no major faults

and code is essentially the same as for 4.3.0


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928



[Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code

2008-03-14 Thread rguenth at gcc dot gnu dot org


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

  Known to fail||4.3.0
   Target Milestone|4.3.0   |4.3.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928