[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

2012-02-20 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

--- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-20 
12:21:19 UTC ---
Even though it makes sense (I think) the patch regresses more benchmarks
than it fixes, and it does not fix the 410.bwaves regression fully.  Defering
to 4.8 as I don't have any better ideas off my head.


[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

2012-02-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Keywords||missed-optimization
   Last reconfirmed||2012-02-16
  Component|regression  |tree-optimization
 AssignedTo|unassigned at gcc dot   |rguenth at gcc dot gnu.org
   |gnu.org |
 Ever Confirmed|0   |1
   Target Milestone|--- |4.7.0

--- Comment #3 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-16 
10:39:34 UTC ---
I didn't see the regression on a SandyBridge machine with -Ofast -funroll-loops
-fpeel-loops -march=native [-flto].  But I can see the regression on an
AMD Bulldozer machine.

I will have a look.  Note that the patch was to fix a wrong-code issue, so
we may have to live with the performance regression.


[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

2012-02-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rakdver at gcc dot gnu.org

--- Comment #5 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-16 
12:41:30 UTC ---
Proposed patch:

Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 184304)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -2506,6 +2506,15 @@ add_iv_value_candidates (struct ivopts_d
   if (offset
   || base != iv-base)
 add_candidate (data, base, iv-step, false, use);
+
+  /* Fourth, try removing the base-object for pointer IVs.  */
+  if (TREE_CODE (iv-base) == POINTER_PLUS_EXPR)
+{
+  tree base_object = iv-base_object;
+  STRIP_NOPS (base_object);
+  if (operand_equal_p (TREE_OPERAND (iv-base, 0), base_object, 0))
+   add_candidate (data, TREE_OPERAND (iv-base, 1), iv-step, false, use);
+}
 }

 /* Adds candidates based on the uses.  */
@@ -4062,7 +4071,13 @@ get_computation_cost_at (struct ivopts_d
   if (use-iv-base_object
   cand-iv-base_object
   !operand_equal_p (use-iv-base_object, cand-iv-base_object, 0))
-   return infinite_cost;
+   {
+ if (dump_file  (dump_flags  TDF_DETAILS))
+   fprintf (dump_file, Not considering candidate %d for use %d 
+because they have different pointer bases%s\n,
+cand-id, use-id, address_p ? (address_p) : );
+ return infinite_cost;
+   }
 }

   if (TYPE_PRECISION (utype)  TYPE_PRECISION (ctype))


[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

2012-02-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

--- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-16 
12:40:06 UTC ---
Before the patch we choose

Improved to:
  cost: 128 (complexity 0)
  cand_cost: 19
  cand_use_cost: 28 (complexity 0)
  candidates: 2, 4, 7
   use:0 -- iv_cand:4, cost=(2,0)
   use:1 -- iv_cand:4, cost=(2,0)
   use:2 -- iv_cand:2, cost=(0,0)
   use:3 -- iv_cand:7, cost=(0,0)
   use:4 -- iv_cand:7, cost=(4,0)
   use:5 -- iv_cand:7, cost=(4,0)
   use:6 -- iv_cand:7, cost=(4,0)
   use:7 -- iv_cand:7, cost=(4,0)
   use:8 -- iv_cand:7, cost=(4,0)
   use:9 -- iv_cand:7, cost=(4,0)

and now we do not consider for example candidate 7 for use 4:

candidate 7
  var_before ivtmp.190
  var_after ivtmp.190
  incremented before exit test
  type character(kind=4)
  base (character(kind=4)) (a_296(D) + (((sizetype) stride.88_9 + (sizetype)
pretmp.141_661) + 1) * 8)
  step 8
  base object (void *) a_296(D)

use 4
  generic
  in statement D.2322_387 = axp_318(D) + D.2321_367;

  at position
  type real(kind=8)[0:D.1963] * restrict
  base axp_318(D) + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1)
* 8
  step 8
  base object (void *) axp_318(D)
  related candidates

and we really do not want to do that because of the wrong-code issue.
We instead end up with

Improved to:
  cost: 133 (complexity 7)
  cand_cost: 13
  cand_use_cost: 39 (complexity 7)
  candidates: 4, 5
   use:0 -- iv_cand:4, cost=(2,0)
   use:1 -- iv_cand:4, cost=(2,0)
   use:2 -- iv_cand:5, cost=(0,0)
   use:3 -- iv_cand:5, cost=(5,1)
   use:4 -- iv_cand:5, cost=(5,1)
   use:5 -- iv_cand:5, cost=(5,1)
   use:6 -- iv_cand:5, cost=(5,1)
   use:7 -- iv_cand:5, cost=(5,1)
   use:8 -- iv_cand:5, cost=(5,1)
   use:9 -- iv_cand:5, cost=(5,1)

where

candidate 5 (important)
  var_before ivtmp.188 
  var_after ivtmp.188
  incremented before exit test
  type sizetype
  base 0
  step 8

I think what we miss to relate uses 4 to 9 which all are of the form
 base parameter + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1)
* 8
is to have a candidate which has the base object stripped and thus
only tracks
 (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
which we have as IV at least:
ssa name D.2332_451
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
and redundant:
ssa name D.2354_680
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2343_692
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2365_752
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
ssa name D.2376_763
  type sizetype
  base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8
  step 8
but no associated candidate(s).  If we add a candidate for it (9) we
end up with

Improved to:
  cost: 131 (complexity 0)
  cand_cost: 15
  cand_use_cost: 35 (complexity 0)
  candidates: 4, 9
   use:0 -- iv_cand:4, cost=(2,0)
   use:1 -- iv_cand:4, cost=(2,0)
   use:2 -- iv_cand:9, cost=(3,0)
   use:3 -- iv_cand:9, cost=(4,0)
   use:4 -- iv_cand:9, cost=(4,0)
   use:5 -- iv_cand:9, cost=(4,0)
   use:6 -- iv_cand:9, cost=(4,0)
   use:7 -- iv_cand:9, cost=(4,0)
   use:8 -- iv_cand:9, cost=(4,0)
   use:9 -- iv_cand:9, cost=(4,0)

but with that change we now unroll the innermost loop twice, so I'm not
sure it will pay off.  The code generation differences even for the
originally patch that caused the regression are only in scheduling
and register allocation (so -fschedule-insns may recover it, or
-fsched-pressure).


[Bug tree-optimization/52272] [4.7 regression] Performance regresswion of 410.bwaves on x86.

2012-02-16 Thread vbyakovl23 at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272

--- Comment #6 from Vladimir Yakovlev vbyakovl23 at gmail dot com 2012-02-16 
14:42:36 UTC ---
I've checked. The patch fixes the regression. Thanks.