Re: Reduce complette unrolling peeling limits
On Tue, 4 Dec 2012, Jan Hubicka wrote: here is updated patch. It should get the bounds safe enough to not have effect on codegen of complette unrolling. There is IMO no way to cut the walk of loop body w/o affecting codegen in unrolling for size mode. The condition on unroling to happen is unrolled_size * 2 / 3 original_size The patch makes the function walking body to stop after minimal number of duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula above allows unlimited duplication when loop body is large enough. This is more a bug than feature, so I think it is safe to alter it. Bootstrapped/regtested x86_64-linux, OK? Honza * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND parameter. (try_unroll_loop_completely) Update. The patch hasn't been installed, has it? The test still takes 20s to compile at -O3 on a fast x86-64 box, so you can imagine what this yields on slower machines (and that's before the x4 because of the various dg-torture options). Yes, I need approval for this one. http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html Ok. Thanks, Richard.
Re: Reduce complette unrolling peeling limits
here is updated patch. It should get the bounds safe enough to not have effect on codegen of complette unrolling. There is IMO no way to cut the walk of loop body w/o affecting codegen in unrolling for size mode. The condition on unroling to happen is unrolled_size * 2 / 3 original_size The patch makes the function walking body to stop after minimal number of duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula above allows unlimited duplication when loop body is large enough. This is more a bug than feature, so I think it is safe to alter it. Bootstrapped/regtested x86_64-linux, OK? Honza * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND parameter. (try_unroll_loop_completely) Update. The patch hasn't been installed, has it? The test still takes 20s to compile at -O3 on a fast x86-64 box, so you can imagine what this yields on slower machines (and that's before the x4 because of the various dg-torture options). Yes, I need approval for this one. http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html Honza
Re: Reduce complette unrolling peeling limits
here is updated patch. It should get the bounds safe enough to not have effect on codegen of complette unrolling. There is IMO no way to cut the walk of loop body w/o affecting codegen in unrolling for size mode. The condition on unroling to happen is unrolled_size * 2 / 3 original_size The patch makes the function walking body to stop after minimal number of duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula above allows unlimited duplication when loop body is large enough. This is more a bug than feature, so I think it is safe to alter it. Bootstrapped/regtested x86_64-linux, OK? Honza * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND parameter. (try_unroll_loop_completely) Update. The patch hasn't been installed, has it? The test still takes 20s to compile at -O3 on a fast x86-64 box, so you can imagine what this yields on slower machines (and that's before the x4 because of the various dg-torture options). -- Eric Botcazou
Re: Reduce complette unrolling peeling limits
... I believe I posted a patch? Yes: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01799.html I have found another fall out: I have some avatars of the polyhedron tests where the REAL(8) have been replaced with REAL(10). Some of them are now Should I open a new PR for that? Cheers, Dominique
Re: Reduce complette unrolling peeling limits
My mailer has eaten a line in my previous mail. One should read: I have found another fall out: I have some avatars of the polyhedron tests where the REAL(8) have been replaced with REAL(10). Some of them are now ~50% slower with the new value of max-completely-peeled-insns. Should I open a new PR for that? Sorry for the noise. Dominique
Re: Reduce complette unrolling peeling limits
On Sun, 18 Nov 2012, Jan Hubicka wrote: this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: I did not :(. I am currently on a trip, but will take a look on tuesday. If it seems to disturb testing, please just revert the patch for time being. OK, here are multiple issues. 1) recursive inlining makes huge loop nest (of 18 loops) 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks the nest 3) unroller is computing loop body size even when it is clear the body is much larger than the limit (the outer loop has 78000 instructions) I will prepare patches to fix those issues. The recent (well, a week ago) params.def change also regressed gfortran.dg/reassoc_4.f almost everywhere; see PR55452. I guess a fix is fairly trivial, I just don't know to what. brgds, H-P
Re: Reduce complette unrolling peeling limits
On Sun, 18 Nov 2012, Jan Hubicka wrote: this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: I did not :(. I am currently on a trip, but will take a look on tuesday. If it seems to disturb testing, please just revert the patch for time being. OK, here are multiple issues. 1) recursive inlining makes huge loop nest (of 18 loops) 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks the nest 3) unroller is computing loop body size even when it is clear the body is much larger than the limit (the outer loop has 78000 instructions) I will prepare patches to fix those issues. The recent (well, a week ago) params.def change also regressed gfortran.dg/reassoc_4.f almost everywhere; see PR55452. I guess a fix is fairly trivial, I just don't know to what. Yes, we siply should add explicit unrolling limits there, I believe I posted a patch? I am currently on a way, I will look up the message and/or post it. Honza brgds, H-P
Re: Reduce complette unrolling peeling limits
Hi Jan, this is patch I will try to test once I have chance :) It simply prevents unroller from analyzing loops when they are already too large. ... This patch breaks bootstrap with ... /opt/gcc/p_build/./prev-gcc/g++ -B/opt/gcc/p_build/./prev-gcc/ -B/opt/gcc/gcc4.8p-193652p3/x86_64-apple-darwin10.8.0/bin/ -nostdinc++ -B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs -B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs -I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include/x86_64-apple-darwin10.8.0 -I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include -I/opt/gcc/p_work/libstdc++-v3/libsupc++ -L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs -L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs -c -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -DHAVE_CONFIG_H -I. -I. -I../../p_work/gcc -I../../p_work/gcc/. -I../../p_work/gcc/../include -I./../intl -I../../p_work/gcc/../libcpp/include -I/opt/mp/include -I../../p_work/gcc/../libdecnumber -I../../p_work/gcc/../libdecnumber/dpd -I../libdecnumber -I../../p_work/gcc/../libbacktrace -DCLOOG_INT_GMP -I/opt/mp/include ../../p_work/gcc/tree-ssa-loop-ivopts.c -o tree-ssa-loop-ivopts.o ../../p_work/gcc/tree-ssa-loop-ivcanon.c: In function 'bool canonicalize_loop_induction_variables(loop*, bool, unroll_level, bool)': ../../p_work/gcc/tree-ssa-loop-ivcanon.c:690:62: error: 'n_unroll' may be used uninitialized in this function [-Werror=maybe-uninitialized] (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter n_unroll)) ^ ../../p_work/gcc/tree-ssa-loop-ivcanon.c:656:26: note: 'n_unroll' was declared here unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns; ^ cc1plus: all warnings being treated as errors ... I have completed bootstrap with the following change --- ../_clean/gcc/tree-ssa-loop-ivcanon.c 2012-11-18 11:27:28.0 +0100 +++ gcc/tree-ssa-loop-ivcanon.c 2012-11-20 16:27:07.0 +0100 @@ -641,9 +641,10 @@ try_unroll_loop_completely (struct loop enum unroll_level ul, HOST_WIDE_INT maxiter) { - unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns; + unsigned HOST_WIDE_INT ninsns, max_unroll, unr_insns; gimple cond; struct loop_size size; + unsigned HOST_WIDE_INT n_unroll = 0; bool n_unroll_found = false; edge edge_to_cancel = NULL; int num = loop-num; After that the compilation of gcc.c-torture/compile/pr43186.c is back to a fraction of a second, but I see the following regressions: FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite will be interchanged 2 FAIL: gcc.dg/graphite/pr42530.c (internal compiler error) FAIL: gcc.dg/graphite/pr42530.c (test for excess errors) FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Unrolled loop 1 completely .duplicated 2 times.. FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Last iteration exit edge was proved true. FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli Unrolled loop 1 completely .duplicated 1 times.. FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 c.array FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized my_array FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized if \\( FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -g execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -g execution test for both -m32 and -m64 + FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre Insertions: 2 1 with -m32. Dominique
Re: Reduce complette unrolling peeling limits
FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite will be interchanged 2 FAIL: gcc.dg/graphite/pr42530.c (internal compiler error) FAIL: gcc.dg/graphite/pr42530.c (test for excess errors) FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Unrolled loop 1 completely .duplicated 2 times.. FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Last iteration exit edge was proved true. FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli Unrolled loop 1 completely .duplicated 1 times.. FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 c.array FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized my_array FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized if \\( FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -g execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: libgomp.fortran/reduction2.f90 -O3 -g execution test Yep, problem here is the *2/3 heuristic in estimated unroller body size. I am back to internet access, so I will look into it today or tomorrow. Honza for both -m32 and -m64 + FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre Insertions: 2 1 with -m32. Dominique
Re: Reduce complette unrolling peeling limits
Hi, here is updated patch. It should get the bounds safe enough to not have effect on codegen of complette unrolling. There is IMO no way to cut the walk of loop body w/o affecting codegen in unrolling for size mode. The condition on unroling to happen is unrolled_size * 2 / 3 original_size The patch makes the function walking body to stop after minimal number of duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula above allows unlimited duplication when loop body is large enough. This is more a bug than feature, so I think it is safe to alter it. Bootstrapped/regtested x86_64-linux, OK? Honza * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND parameter. (try_unroll_loop_completely) Update. Index: tree-ssa-loop-ivcanon.c === --- tree-ssa-loop-ivcanon.c (revision 193694) +++ tree-ssa-loop-ivcanon.c (working copy) @@ -1,5 +1,5 @@ -/* Induction variable canonicalization. - Copyright (C) 2004, 2005, 2007, 2008, 2010 +/* Induction variable canonicalization and loop peeling. + Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012 Free Software Foundation, Inc. This file is part of GCC. @@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple iteration of the loop. EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last iteration of loop. - Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. */ + Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. + Stop estimating after UPPER_BOUND is met. Return true in this case */ -static void -tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size) +static bool +tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size, +int upper_bound) { basic_block *body = get_loop_body (loop); gimple_stmt_iterator gsi; @@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo if (likely_eliminated || likely_eliminated_last) size-last_iteration_eliminated_by_peeling += num; } + if ((size-overall * 3 / 2 - size-eliminated_by_peeling + - size-last_iteration_eliminated_by_peeling) upper_bound) + { + free (body); + return true; + } } } while (path.length ()) @@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo size-last_iteration_eliminated_by_peeling); free (body); + return false; } /* Estimate number of insns of completely unrolled loop. @@ -699,12 +711,22 @@ try_unroll_loop_completely (struct loop sbitmap wont_exit; edge e; unsigned i; - vecedge to_remove = vNULL; + bool large; + vecedge to_remove = vNULL; if (ul == UL_SINGLE_ITER) return false; - tree_estimate_loop_size (loop, exit, edge_to_cancel, size); + large = tree_estimate_loop_size +(loop, exit, edge_to_cancel, size, + PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS)); ninsns = size.overall; + if (large) + { + if (dump_file (dump_flags TDF_DETAILS)) + fprintf (dump_file, Not unrolling loop %d: it is too large.\n, +loop-num); + return false; + } unr_insns = estimated_unrolled_size (size, n_unroll); if (dump_file (dump_flags TDF_DETAILS))
Re: Reduce complette unrolling peeling limits
Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: ... On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28) takes a small fraction of a second, while with revision 193270 (2012-11-06) it takes ~25s. However this patch makes gfortran.dg/reassoc_4.f to fail FAIL: gfortran.dg/reassoc_4.f -O scan-tree-dump-times reassoc1 [0-9] * 22 After it 22 should be replaced with 16 (thresshold max-completely-peeled-insns=138 gives 16, =139 gives 22). I would propose the following patch instead. The patch anyway changes the limits on some targets, so lets change them on all. Honza Index: reassoc_4.f === --- reassoc_4.f (revision 193698) +++ reassoc_4.f (working copy) @@ -1,7 +1,5 @@ ! { dg-do compile } -! { dg-options -O3 -ffast-math -fdump-tree-reassoc1 } -! { dg-additional-options --param max-completely-peel-times=16 { target spu-*-* } } -! { dg-additional-options --param max-completely-peeled-insns=400 { target s390*-*-* } } +! { dg-options -O3 -ffast-math -fdump-tree-reassoc1 --param max-completely-peel-times=16 --param max-completely-peeled-insns=400 } subroutine anisonl(w,vo,anisox,s,ii1,jj1,weight) integer ii1,jj1,i1,iii1,j1,jjj1,k1,l1,m1,n1 real*8 w(3,3),vo(3,3),anisox(3,3,3,3),s(60,60),weight Dominique
Re: Reduce complette unrolling peeling limits
Hi, this is patch I will try to test once I have chance :) t simply prevents unroller from analyzing loops when they are already too large. * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND parameter. (try_unroll_loop_completely) Update. Index: tree-ssa-loop-ivcanon.c === --- tree-ssa-loop-ivcanon.c (revision 193598) +++ tree-ssa-loop-ivcanon.c (working copy) @@ -1,5 +1,5 @@ -/* Induction variable canonicalization. - Copyright (C) 2004, 2005, 2007, 2008, 2010 +/* Induction variable canonicalization and loop peeling. + Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012 Free Software Foundation, Inc. This file is part of GCC. @@ -29,9 +29,12 @@ along with GCC; see the file COPYING3. variables. In that case the created optimization possibilities are likely to pay up. - Additionally in case we detect that it is beneficial to unroll the - loop completely, we do it right here to expose the optimization - possibilities to the following passes. */ + We also perform + - complette unrolling (or peeling) when the loops is rolling few enough + times + - simple peeling (i.e. copying few initial iterations prior the loop) + when number of iteration estimate is known (typically by the profile + info). */ #include config.h #include system.h @@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple iteration of the loop. EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last iteration of loop. - Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. */ + Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. + Stop estimating after UPPER_BOUND is met. Return true in this case */ -static void -tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size) +static bool +tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size, +int upper_bound) { basic_block *body = get_loop_body (loop); gimple_stmt_iterator gsi; @@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo if (likely_eliminated || likely_eliminated_last) size-last_iteration_eliminated_by_peeling += num; } + if ((size-overall - size-eliminated_by_peeling + - size-last_iteration_eliminated_by_peeling) upper_bound) + { + free (body); + return true; + } } } while (path.length ()) @@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo size-last_iteration_eliminated_by_peeling); free (body); + return false; } /* Estimate number of insns of completely unrolled loop. @@ -699,12 +711,23 @@ try_unroll_loop_completely (struct loop sbitmap wont_exit; edge e; unsigned i; + bool large; vecedge to_remove = vecedge(); if (ul == UL_SINGLE_ITER) return false; - tree_estimate_loop_size (loop, exit, edge_to_cancel, size); + large = tree_estimate_loop_size +(loop, exit, edge_to_cancel, size, + ul == UL_NO_GROWTH ? 0 + : PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS) * 2); ninsns = size.overall; + if (large) + { + if (dump_file (dump_flags TDF_DETAILS)) + fprintf (dump_file, Not unrolling loop %d: it is too large.\n, +loop-num); + return false; + } unr_insns = estimated_unrolled_size (size, n_unroll); if (dump_file (dump_flags TDF_DETAILS)) @@ -865,6 +888,133 @@ try_unroll_loop_completely (struct loop return true; } +/* Return number of instructions after peeling. */ +static unsigned HOST_WIDE_INT +estimated_peeled_sequence_size (struct loop_size *size, + unsigned HOST_WIDE_INT npeel) +{ + return MAX (npeel * (HOST_WIDE_INT) (size-overall + - size-eliminated_by_peeling), 1); +} + +/* If the loop is expected to iterate N times and is + small enough, duplicate the loop body N+1 times before + the loop itself. This way the hot path will never + enter the loop. + Parameters are the same as for try_unroll_loops_completely */ + +static bool +try_peel_loop (struct loop *loop, + edge exit, tree niter, + HOST_WIDE_INT maxiter) +{ + int npeel; + struct loop_size size; + int peeled_size; + sbitmap wont_exit; + unsigned i; + vecedge to_remove = vecedge(); + edge e; + + /* If the iteration bound is known and large, then we can safely eliminate + the check in peeled copies. */ + if (TREE_CODE (niter) != INTEGER_CST) +exit = NULL; + + if (!flag_peel_loops || PARAM_VALUE (PARAM_MAX_PEEL_TIMES) = 0) +return false; + + /* Peel only
Re: Reduce complette unrolling peeling limits
this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: FAIL: gcc.c-torture/compile/pr43186.c -O3 -fomit-frame-pointer (test for excess errors) FAIL: gcc.c-torture/compile/pr43186.c -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) FAIL: gcc.c-torture/compile/pr43186.c -O3 -fomit-frame-pointer -funroll-all- loops -finline-functions (test for excess errors) FAIL: gcc.c-torture/compile/pr43186.c -O3 -g (test for excess errors) WARNING: program timed out. WARNING: program timed out. WARNING: program timed out. WARNING: program timed out. -- Eric Botcazou
Re: Reduce complette unrolling peeling limits
Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: ... On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28) takes a small fraction of a second, while with revision 193270 (2012-11-06) it takes ~25s. However this patch makes gfortran.dg/reassoc_4.f to fail FAIL: gfortran.dg/reassoc_4.f -O scan-tree-dump-times reassoc1 [0-9] * 22 After it 22 should be replaced with 16 (thresshold max-completely-peeled-insns=138 gives 16, =139 gives 22). Dominique
Re: Reduce complette unrolling peeling limits
this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: I did not :(. I am currently on a trip, but will take a look on tuesday. If it seems to disturb testing, please just revert the patch for time being. Honza
Re: Reduce complette unrolling peeling limits
this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. Did you notice that gcc.c-torture/compile/pr43186.c regressed? It now again takes a while to compile, so times out on slow machines: I did not :(. I am currently on a trip, but will take a look on tuesday. If it seems to disturb testing, please just revert the patch for time being. OK, here are multiple issues. 1) recursive inlining makes huge loop nest (of 18 loops) 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks the nest 3) unroller is computing loop body size even when it is clear the body is much larger than the limit (the outer loop has 78000 instructions) I will prepare patches to fix those issues. Honza Honza
Re: Reduce complette unrolling peeling limits
OK, here are multiple issues. 1) recursive inlining makes huge loop nest (of 18 loops) 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks the nest 3) unroller is computing loop body size even when it is clear the body is much larger than the limit (the outer loop has 78000 instructions) I will prepare patches to fix those issues. Thanks for the analysis (and don't worry, I won't revert anything :-). -- Eric Botcazou
Re: Reduce complette unrolling peeling limits
On Thu, Nov 15, 2012 at 12:34:07AM +0100, Jan Hubicka wrote: * params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 100. Ok, thanks. --- params.def(revision 193505) +++ params.def(working copy) @@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES, DEFPARAM(PARAM_MAX_PEELED_INSNS, max-peeled-insns, The maximum number of insns of a peeled loop, - 400, 0, 0) + 100, 0, 0) /* The maximum number of peelings of a single loop. */ DEFPARAM(PARAM_MAX_PEEL_TIMES, max-peel-times, @@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES, DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS, max-completely-peeled-insns, The maximum number of insns of a completely peeled loop, - 400, 0, 0) + 100, 0, 0) /* The maximum number of peelings of a single loop that is peeled completely. */ DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES, max-completely-peel-times, Jakub
Reduce complette unrolling peeling limits
Hi, this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to 100. The reason why I am doing this is that I want to reduce code bloat caused by my cunroll work that enabled a lot more unrolling then previously causing considerable code size regression at -O3. I do not think those params was ever serviously tunned, or re-tunned after introduction of tree-ssa peeling. I bootstrapped/regtested x86_64 with few values - 4000, 200, 100, 50 on spec2000,spec2k6,C++ benchmarks and polyhedron. I also did partial tests on ia-64 (that is broken quite a lot now, but I wanted to have some sanity check that these values are not too x86 specific). With 4000 (and also bumped up max-peel-times/max-completely-peel-times) there are improvements on ammp 1360-1460 equake 1800-1840 applu 1450-1500 but i guess those needs to be handled by better heuristic. Otherwise there are no perfromance regression with going 400-100. With 50 there are tiny performance drops on swim and applu. I plan to follow by testing the max-peel times parameters and then doing inliner tests. Bootstrapped/regtested x86_64-linux, OK? Honza * params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 100. Index: params.def === --- params.def (revision 193505) +++ params.def (working copy) @@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES, DEFPARAM(PARAM_MAX_PEELED_INSNS, max-peeled-insns, The maximum number of insns of a peeled loop, - 400, 0, 0) + 100, 0, 0) /* The maximum number of peelings of a single loop. */ DEFPARAM(PARAM_MAX_PEEL_TIMES, max-peel-times, @@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES, DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS, max-completely-peeled-insns, The maximum number of insns of a completely peeled loop, - 400, 0, 0) + 100, 0, 0) /* The maximum number of peelings of a single loop that is peeled completely. */ DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES, max-completely-peel-times,