Re: Reduce complette unrolling peeling limits

2012-12-06 Thread Richard Biener
On Tue, 4 Dec 2012, Jan Hubicka wrote:

   here is updated patch.  It should get the bounds safe enough to not have
   effect on codegen of complette unrolling.
   
   There is IMO no way to cut the walk of loop body w/o affecting codegen in
   unrolling for size mode.  The condition on unroling to happen is
   
unrolled_size * 2 / 3  original_size
   
   The patch makes the function walking body to stop after minimal number of
   duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
   above allows unlimited duplication when loop body is large enough. This is
   more a bug than feature, so I think it is safe to alter it.
   
   Bootstrapped/regtested x86_64-linux, OK?
   
   Honza
   
 * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
 parameter.
 (try_unroll_loop_completely) Update.
  
  The patch hasn't been installed, has it?  The test still takes 20s to 
  compile 
  at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
  machines (and that's before the x4 because of the various dg-torture 
  options).
 
 Yes, I need approval for this one.
 http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html

Ok.

Thanks,
Richard.


Re: Reduce complette unrolling peeling limits

2012-12-04 Thread Jan Hubicka
  here is updated patch.  It should get the bounds safe enough to not have
  effect on codegen of complette unrolling.
  
  There is IMO no way to cut the walk of loop body w/o affecting codegen in
  unrolling for size mode.  The condition on unroling to happen is
  
   unrolled_size * 2 / 3  original_size
  
  The patch makes the function walking body to stop after minimal number of
  duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
  above allows unlimited duplication when loop body is large enough. This is
  more a bug than feature, so I think it is safe to alter it.
  
  Bootstrapped/regtested x86_64-linux, OK?
  
  Honza
  
  * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
  parameter.
  (try_unroll_loop_completely) Update.
 
 The patch hasn't been installed, has it?  The test still takes 20s to compile 
 at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
 machines (and that's before the x4 because of the various dg-torture options).

Yes, I need approval for this one.
http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html

Honza


Re: Reduce complette unrolling peeling limits

2012-12-03 Thread Eric Botcazou
 here is updated patch.  It should get the bounds safe enough to not have
 effect on codegen of complette unrolling.
 
 There is IMO no way to cut the walk of loop body w/o affecting codegen in
 unrolling for size mode.  The condition on unroling to happen is
 
  unrolled_size * 2 / 3  original_size
 
 The patch makes the function walking body to stop after minimal number of
 duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
 above allows unlimited duplication when loop body is large enough. This is
 more a bug than feature, so I think it is safe to alter it.
 
 Bootstrapped/regtested x86_64-linux, OK?
 
 Honza
 
   * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
   parameter.
   (try_unroll_loop_completely) Update.

The patch hasn't been installed, has it?  The test still takes 20s to compile 
at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
machines (and that's before the x4 because of the various dg-torture options).

-- 
Eric Botcazou


Re: Reduce complette unrolling peeling limits

2012-11-25 Thread Dominique Dhumieres
 ... I believe I posted a patch?

Yes: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01799.html

I have found another fall out: I have some avatars of the polyhedron tests
where the REAL(8) have been replaced with REAL(10). Some of them are now
Should I open a new PR for that?

Cheers,

Dominique


Re: Reduce complette unrolling peeling limits

2012-11-25 Thread Dominique Dhumieres
My mailer has eaten a line in my previous mail. One should read:

I have found another fall out: I have some avatars of the polyhedron tests
where the REAL(8) have been replaced with REAL(10). Some of them are now ~50%
slower with the new value of max-completely-peeled-insns.
Should I open a new PR for that?

Sorry for the noise.

Dominique


Re: Reduce complette unrolling peeling limits

2012-11-23 Thread Hans-Peter Nilsson
On Sun, 18 Nov 2012, Jan Hubicka wrote:
this patch reduces max-peeled-insns and max-completely-peeled-insns 
from 400
to 100.  The reason why I am doing this is that I want to reduce code 
bloat
caused by my cunroll work that enabled a lot more unrolling then 
previously
causing considerable code size regression at -O3.
  
   Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now 
   again
   takes a while to compile, so times out on slow machines:
 
  I did not :(.  I am currently on a trip, but will take a look on tuesday.
  If it seems to disturb testing, please just revert the patch for time being.

 OK, here are multiple issues.
 1) recursive inlining makes huge loop nest (of 18 loops)
 2) SCEV is very slow on answering simple_iv tests in this case becuase it 
 walks
the nest
 3) unroller is computing loop body size even when it is clear the body is 
 much larger
than the limit (the outer loop has 78000 instructions)

 I will prepare patches to fix those issues.

The recent (well, a week ago) params.def change also regressed
gfortran.dg/reassoc_4.f almost everywhere; see PR55452.

I guess a fix is fairly trivial, I just don't know to what.

brgds, H-P


Re: Reduce complette unrolling peeling limits

2012-11-23 Thread Jan Hubicka
 On Sun, 18 Nov 2012, Jan Hubicka wrote:
 this patch reduces max-peeled-insns and max-completely-peeled-insns 
 from 400
 to 100.  The reason why I am doing this is that I want to reduce code 
 bloat
 caused by my cunroll work that enabled a lot more unrolling then 
 previously
 causing considerable code size regression at -O3.
   
Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now 
again
takes a while to compile, so times out on slow machines:
  
   I did not :(.  I am currently on a trip, but will take a look on tuesday.
   If it seems to disturb testing, please just revert the patch for time 
   being.
 
  OK, here are multiple issues.
  1) recursive inlining makes huge loop nest (of 18 loops)
  2) SCEV is very slow on answering simple_iv tests in this case becuase it 
  walks
 the nest
  3) unroller is computing loop body size even when it is clear the body is 
  much larger
 than the limit (the outer loop has 78000 instructions)
 
  I will prepare patches to fix those issues.
 
 The recent (well, a week ago) params.def change also regressed
 gfortran.dg/reassoc_4.f almost everywhere; see PR55452.
 
 I guess a fix is fairly trivial, I just don't know to what.

Yes, we siply should add explicit unrolling limits there, I believe I posted a 
patch?
I am currently on a way, I will look up the message and/or post it.

Honza
 
 brgds, H-P


Re: Reduce complette unrolling peeling limits

2012-11-21 Thread Dominique Dhumieres
Hi Jan,

 this is patch I will try to test once I have chance :)
 It simply prevents unroller from analyzing loops when they are already too 
 large.
 ...

This patch breaks bootstrap with

...
/opt/gcc/p_build/./prev-gcc/g++ -B/opt/gcc/p_build/./prev-gcc/ 
-B/opt/gcc/gcc4.8p-193652p3/x86_64-apple-darwin10.8.0/bin/ -nostdinc++ 
-B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs 
-B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs 
-I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include/x86_64-apple-darwin10.8.0
 -I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include 
-I/opt/gcc/p_work/libstdc++-v3/libsupc++ 
-L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs 
-L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs 
-c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -fno-exceptions -fno-rtti 
-fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
-Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long 
-Wno-variadic-macros -Wno-overlength-strings -Werror   -DHAVE_CONFIG_H -I. -I. 
-I../../p_work/gcc -I../../p_work/gcc/. -I../../p_work/gcc/../include 
-I./../intl -I../../p_work/gcc/../libcpp/include -I/opt/mp/include  
-I../../p_work/gcc/../libdecnumber -I../../p_work/gcc/../libdecnumber/dpd 
-I../libdecnumber -I../../p_work/gcc/../libbacktrace -DCLOOG_INT_GMP  
-I/opt/mp/include  ../../p_work/gcc/tree-ssa-loop-ivopts.c -o 
tree-ssa-loop-ivopts.o
../../p_work/gcc/tree-ssa-loop-ivcanon.c: In function 'bool 
canonicalize_loop_induction_variables(loop*, bool, unroll_level, bool)':
../../p_work/gcc/tree-ssa-loop-ivcanon.c:690:62: error: 'n_unroll' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
(!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter  n_unroll))
  ^
../../p_work/gcc/tree-ssa-loop-ivcanon.c:656:26: note: 'n_unroll' was declared 
here
   unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
  ^
cc1plus: all warnings being treated as errors
...

I have completed bootstrap with the following change

--- ../_clean/gcc/tree-ssa-loop-ivcanon.c   2012-11-18 11:27:28.0 
+0100
+++ gcc/tree-ssa-loop-ivcanon.c 2012-11-20 16:27:07.0 +0100
@@ -641,9 +641,10 @@ try_unroll_loop_completely (struct loop 
enum unroll_level ul,
HOST_WIDE_INT maxiter)
 {
-  unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
+  unsigned HOST_WIDE_INT ninsns, max_unroll, unr_insns;
   gimple cond;
   struct loop_size size;
+  unsigned HOST_WIDE_INT n_unroll = 0;
   bool n_unroll_found = false;
   edge edge_to_cancel = NULL;
   int num = loop-num;

After that the compilation of gcc.c-torture/compile/pr43186.c is back to
a fraction of a second, but I see the following regressions:

FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite will be 
interchanged 2
FAIL: gcc.dg/graphite/pr42530.c (internal compiler error)
FAIL: gcc.dg/graphite/pr42530.c (test for excess errors)
FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Unrolled loop 1 
completely .duplicated 2 times..
FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Last iteration exit 
edge was proved true.
FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli Unrolled loop 1 
completely .duplicated 1 times..
FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 c.array
FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized my_array
FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized if \\(
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  
execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  
execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test

for both -m32 and -m64 +

FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre Insertions: 2 1

with -m32.

Dominique



Re: Reduce complette unrolling peeling limits

2012-11-21 Thread Jan Hubicka
 FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite will be 
 interchanged 2
 FAIL: gcc.dg/graphite/pr42530.c (internal compiler error)
 FAIL: gcc.dg/graphite/pr42530.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Unrolled loop 1 
 completely .duplicated 2 times..
 FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli Last iteration 
 exit edge was proved true.
 FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli Unrolled loop 1 
 completely .duplicated 1 times..
 FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 c.array
 FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized my_array
 FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized if \\(
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops 
  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer 
 -funroll-all-loops -finline-functions  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops 
  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer 
 -funroll-all-loops -finline-functions  execution test
 FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test

Yep, problem here is the *2/3 heuristic in estimated unroller body size. I am
back to internet access, so  I will look into it today or tomorrow.

Honza
 
 for both -m32 and -m64 +
 
 FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre Insertions: 2 1
 
 with -m32.
 
 Dominique


Re: Reduce complette unrolling peeling limits

2012-11-21 Thread Jan Hubicka
Hi,
here is updated patch.  It should get the bounds safe enough to not have effect
on codegen of complette unrolling.

There is IMO no way to cut the walk of loop body w/o affecting codegen in 
unrolling for size mode.  The condition on unroling to happen is

 unrolled_size * 2 / 3  original_size

The patch makes the function walking body to stop after minimal number of
duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
above allows unlimited duplication when loop body is large enough. This is
more a bug than feature, so I think it is safe to alter it.

Bootstrapped/regtested x86_64-linux, OK?

Honza

* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
parameter.
(try_unroll_loop_completely) Update.


Index: tree-ssa-loop-ivcanon.c
===
--- tree-ssa-loop-ivcanon.c (revision 193694)
+++ tree-ssa-loop-ivcanon.c (working copy)
@@ -1,5 +1,5 @@
-/* Induction variable canonicalization.
-   Copyright (C) 2004, 2005, 2007, 2008, 2010
+/* Induction variable canonicalization and loop peeling.
+   Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012
Free Software Foundation, Inc.
 
 This file is part of GCC.
@@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple
iteration of the loop.
EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last 
iteration
of loop.
-   Return results in SIZE, estimate benefits for complete unrolling exiting by 
EXIT.  */
+   Return results in SIZE, estimate benefits for complete unrolling exiting by 
EXIT. 
+   Stop estimating after UPPER_BOUND is met. Return true in this case */
 
-static void
-tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, 
struct loop_size *size)
+static bool
+tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, 
struct loop_size *size,
+int upper_bound)
 {
   basic_block *body = get_loop_body (loop);
   gimple_stmt_iterator gsi;
@@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo
  if (likely_eliminated || likely_eliminated_last)
size-last_iteration_eliminated_by_peeling += num;
}
+ if ((size-overall * 3 / 2 - size-eliminated_by_peeling
+ - size-last_iteration_eliminated_by_peeling)  upper_bound)
+   {
+  free (body);
+ return true;
+   }
}
 }
   while (path.length ())
@@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo
 size-last_iteration_eliminated_by_peeling);
 
   free (body);
+  return false;
 }
 
 /* Estimate number of insns of completely unrolled loop.
@@ -699,12 +711,22 @@ try_unroll_loop_completely (struct loop
   sbitmap wont_exit;
   edge e;
   unsigned i;
-  vecedge to_remove = vNULL;
+  bool large;
+  vecedge to_remove = vNULL;
   if (ul == UL_SINGLE_ITER)
return false;
 
-  tree_estimate_loop_size (loop, exit, edge_to_cancel, size);
+  large = tree_estimate_loop_size
+(loop, exit, edge_to_cancel, size,
+ PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS));
   ninsns = size.overall;
+  if (large)
+   {
+ if (dump_file  (dump_flags  TDF_DETAILS))
+   fprintf (dump_file, Not unrolling loop %d: it is too large.\n,
+loop-num);
+ return false;
+   }
 
   unr_insns = estimated_unrolled_size (size, n_unroll);
   if (dump_file  (dump_flags  TDF_DETAILS))


Re: Reduce complette unrolling peeling limits

2012-11-21 Thread Jan Hubicka
  Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
  takes a while to compile, so times out on slow machines:
  ...
 
 On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28)
 takes a small fraction of a second, while with revision 193270 (2012-11-06)
 it takes ~25s.
 
 However this patch makes gfortran.dg/reassoc_4.f to fail
 
 FAIL: gfortran.dg/reassoc_4.f  -O   scan-tree-dump-times reassoc1 [0-9] 
 *  22
 
 After it 22 should be replaced with 16 (thresshold 
 max-completely-peeled-insns=138
 gives 16, =139 gives 22).

I would propose the following patch instead.  The patch anyway changes the 
limits on some
targets, so lets change them on all.

Honza

Index: reassoc_4.f
===
--- reassoc_4.f (revision 193698)
+++ reassoc_4.f (working copy)
@@ -1,7 +1,5 @@
 ! { dg-do compile }
-! { dg-options -O3 -ffast-math -fdump-tree-reassoc1 }
-! { dg-additional-options --param max-completely-peel-times=16 { target 
spu-*-* } }
-! { dg-additional-options --param max-completely-peeled-insns=400 { target 
s390*-*-* } }
+! { dg-options -O3 -ffast-math -fdump-tree-reassoc1 --param 
max-completely-peel-times=16 --param max-completely-peeled-insns=400 }
   subroutine anisonl(w,vo,anisox,s,ii1,jj1,weight)
   integer ii1,jj1,i1,iii1,j1,jjj1,k1,l1,m1,n1
   real*8 w(3,3),vo(3,3),anisox(3,3,3,3),s(60,60),weight

 
 Dominique


Re: Reduce complette unrolling peeling limits

2012-11-19 Thread Jan Hubicka
Hi,
this is patch I will try to test once I have chance :)
t simply prevents unroller from analyzing loops when they are already too large.

* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
parameter.
(try_unroll_loop_completely) Update.
Index: tree-ssa-loop-ivcanon.c
===
--- tree-ssa-loop-ivcanon.c (revision 193598)
+++ tree-ssa-loop-ivcanon.c (working copy)
@@ -1,5 +1,5 @@
-/* Induction variable canonicalization.
-   Copyright (C) 2004, 2005, 2007, 2008, 2010
+/* Induction variable canonicalization and loop peeling.
+   Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012
Free Software Foundation, Inc.
 
 This file is part of GCC.
@@ -29,9 +29,12 @@ along with GCC; see the file COPYING3.
variables.  In that case the created optimization possibilities are likely
to pay up.
 
-   Additionally in case we detect that it is beneficial to unroll the
-   loop completely, we do it right here to expose the optimization
-   possibilities to the following passes.  */
+   We also perform
+ - complette unrolling (or peeling) when the loops is rolling few enough
+   times
+ - simple peeling (i.e. copying few initial iterations prior the loop)
+   when number of iteration estimate is known (typically by the profile
+   info).  */
 
 #include config.h
 #include system.h
@@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple
iteration of the loop.
EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last 
iteration
of loop.
-   Return results in SIZE, estimate benefits for complete unrolling exiting by 
EXIT.  */
+   Return results in SIZE, estimate benefits for complete unrolling exiting by 
EXIT. 
+   Stop estimating after UPPER_BOUND is met. Return true in this case */
 
-static void
-tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, 
struct loop_size *size)
+static bool
+tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, 
struct loop_size *size,
+int upper_bound)
 {
   basic_block *body = get_loop_body (loop);
   gimple_stmt_iterator gsi;
@@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo
  if (likely_eliminated || likely_eliminated_last)
size-last_iteration_eliminated_by_peeling += num;
}
+ if ((size-overall - size-eliminated_by_peeling
+ - size-last_iteration_eliminated_by_peeling)  upper_bound)
+   {
+  free (body);
+ return true;
+   }
}
 }
   while (path.length ())
@@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo
 size-last_iteration_eliminated_by_peeling);
 
   free (body);
+  return false;
 }
 
 /* Estimate number of insns of completely unrolled loop.
@@ -699,12 +711,23 @@ try_unroll_loop_completely (struct loop
   sbitmap wont_exit;
   edge e;
   unsigned i;
+  bool large;
   vecedge to_remove = vecedge();
   if (ul == UL_SINGLE_ITER)
return false;
 
-  tree_estimate_loop_size (loop, exit, edge_to_cancel, size);
+  large = tree_estimate_loop_size
+(loop, exit, edge_to_cancel, size,
+ ul == UL_NO_GROWTH ? 0
+ : PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS) * 2);
   ninsns = size.overall;
+  if (large)
+   {
+ if (dump_file  (dump_flags  TDF_DETAILS))
+   fprintf (dump_file, Not unrolling loop %d: it is too large.\n,
+loop-num);
+ return false;
+   }
 
   unr_insns = estimated_unrolled_size (size, n_unroll);
   if (dump_file  (dump_flags  TDF_DETAILS))
@@ -865,6 +888,133 @@ try_unroll_loop_completely (struct loop
   return true;
 }
 
+/* Return number of instructions after peeling.  */
+static unsigned HOST_WIDE_INT
+estimated_peeled_sequence_size (struct loop_size *size,
+   unsigned HOST_WIDE_INT npeel)
+{
+  return MAX (npeel * (HOST_WIDE_INT) (size-overall
+  - size-eliminated_by_peeling), 1);
+}
+
+/* If the loop is expected to iterate N times and is
+   small enough, duplicate the loop body N+1 times before
+   the loop itself.  This way the hot path will never
+   enter the loop.  
+   Parameters are the same as for try_unroll_loops_completely */
+
+static bool
+try_peel_loop (struct loop *loop,
+  edge exit, tree niter,
+  HOST_WIDE_INT maxiter)
+{
+  int npeel;
+  struct loop_size size;
+  int peeled_size;
+  sbitmap wont_exit;
+  unsigned i;
+  vecedge to_remove = vecedge();
+  edge e;
+
+  /* If the iteration bound is known and large, then we can safely eliminate
+ the check in peeled copies.  */
+  if (TREE_CODE (niter) != INTEGER_CST)
+exit = NULL;
+
+  if (!flag_peel_loops || PARAM_VALUE (PARAM_MAX_PEEL_TIMES) = 0)
+return false;
+
+  /* Peel only 

Re: Reduce complette unrolling peeling limits

2012-11-18 Thread Eric Botcazou
 this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
 to 100.  The reason why I am doing this is that I want to reduce code bloat
 caused by my cunroll work that enabled a lot more unrolling then previously
 causing considerable code size regression at -O3.

Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again 
takes a while to compile, so times out on slow machines:

FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer  (test for 
excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer -funroll-loops  
(test for excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer -funroll-all-
loops -finline-functions  (test for excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -g  (test for excess errors)
WARNING: program timed out.
WARNING: program timed out.
WARNING: program timed out.
WARNING: program timed out.

-- 
Eric Botcazou


Re: Reduce complette unrolling peeling limits

2012-11-18 Thread Dominique Dhumieres
 Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
 takes a while to compile, so times out on slow machines:
 ...

On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28)
takes a small fraction of a second, while with revision 193270 (2012-11-06)
it takes ~25s.

However this patch makes gfortran.dg/reassoc_4.f to fail

FAIL: gfortran.dg/reassoc_4.f  -O   scan-tree-dump-times reassoc1 [0-9] * 
 22

After it 22 should be replaced with 16 (thresshold 
max-completely-peeled-insns=138
gives 16, =139 gives 22).

Dominique


Re: Reduce complette unrolling peeling limits

2012-11-18 Thread Jan Hubicka
  this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
  to 100.  The reason why I am doing this is that I want to reduce code bloat
  caused by my cunroll work that enabled a lot more unrolling then previously
  causing considerable code size regression at -O3.
 
 Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again 
 takes a while to compile, so times out on slow machines:

I did not :(.  I am currently on a trip, but will take a look on tuesday.
If it seems to disturb testing, please just revert the patch for time being.

Honza


Re: Reduce complette unrolling peeling limits

2012-11-18 Thread Jan Hubicka
   this patch reduces max-peeled-insns and max-completely-peeled-insns from 
   400
   to 100.  The reason why I am doing this is that I want to reduce code 
   bloat
   caused by my cunroll work that enabled a lot more unrolling then 
   previously
   causing considerable code size regression at -O3.
  
  Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now 
  again 
  takes a while to compile, so times out on slow machines:
 
 I did not :(.  I am currently on a trip, but will take a look on tuesday.
 If it seems to disturb testing, please just revert the patch for time being.

OK, here are multiple issues.
1) recursive inlining makes huge loop nest (of 18 loops)
2) SCEV is very slow on answering simple_iv tests in this case becuase it walks
   the nest
3) unroller is computing loop body size even when it is clear the body is much 
larger
   than the limit (the outer loop has 78000 instructions)

I will prepare patches to fix those issues. 

Honza
 
 Honza


Re: Reduce complette unrolling peeling limits

2012-11-18 Thread Eric Botcazou
 OK, here are multiple issues.
 1) recursive inlining makes huge loop nest (of 18 loops)
 2) SCEV is very slow on answering simple_iv tests in this case becuase it
 walks the nest
 3) unroller is computing loop body size even when it is clear the body is
 much larger than the limit (the outer loop has 78000 instructions)
 
 I will prepare patches to fix those issues.

Thanks for the analysis (and don't worry, I won't revert anything :-).

-- 
Eric Botcazou


Re: Reduce complette unrolling peeling limits

2012-11-15 Thread Jakub Jelinek
On Thu, Nov 15, 2012 at 12:34:07AM +0100, Jan Hubicka wrote:
   * params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 
 100.

Ok, thanks.

 --- params.def(revision 193505)
 +++ params.def(working copy)
 @@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES,
  DEFPARAM(PARAM_MAX_PEELED_INSNS,
   max-peeled-insns,
   The maximum number of insns of a peeled loop,
 - 400, 0, 0)
 + 100, 0, 0)
  /* The maximum number of peelings of a single loop.  */
  DEFPARAM(PARAM_MAX_PEEL_TIMES,
   max-peel-times,
 @@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES,
  DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
   max-completely-peeled-insns,
   The maximum number of insns of a completely peeled loop,
 - 400, 0, 0)
 + 100, 0, 0)
  /* The maximum number of peelings of a single loop that is peeled 
 completely.  */
  DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
   max-completely-peel-times,

Jakub


Reduce complette unrolling peeling limits

2012-11-14 Thread Jan Hubicka
Hi,
this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to
100.  The reason why I am doing this is that I want to reduce code bloat caused
by my cunroll work that enabled a lot more unrolling then previously causing
considerable code size regression at -O3.

I do not think those params was ever serviously tunned, or re-tunned after
introduction of tree-ssa peeling.  I bootstrapped/regtested x86_64 with few
values - 4000, 200, 100, 50 on spec2000,spec2k6,C++ benchmarks and polyhedron.

I also did partial tests on ia-64 (that is broken quite a lot now, but I wanted
to have some sanity check that these values are not too x86 specific).

With 4000 (and also bumped up max-peel-times/max-completely-peel-times) there
are improvements on
  ammp 1360-1460
  equake 1800-1840
  applu 1450-1500
but i guess those needs to be handled by better heuristic.

Otherwise there are no perfromance regression with going 400-100. With 50
there are tiny performance drops on swim and applu.

I plan to follow by testing the max-peel times parameters and then doing inliner
tests.

Bootstrapped/regtested x86_64-linux, OK?

Honza

* params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 
100.
Index: params.def
===
--- params.def  (revision 193505)
+++ params.def  (working copy)
@@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES,
 DEFPARAM(PARAM_MAX_PEELED_INSNS,
max-peeled-insns,
The maximum number of insns of a peeled loop,
-   400, 0, 0)
+   100, 0, 0)
 /* The maximum number of peelings of a single loop.  */
 DEFPARAM(PARAM_MAX_PEEL_TIMES,
max-peel-times,
@@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES,
 DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
max-completely-peeled-insns,
The maximum number of insns of a completely peeled loop,
-   400, 0, 0)
+   100, 0, 0)
 /* The maximum number of peelings of a single loop that is peeled completely.  
*/
 DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
max-completely-peel-times,