Re: [PATCH] Address lowering [1/3] Main patch

2011-07-20 Thread Richard Guenther
On Tue, 19 Jul 2011, William J. Schmidt wrote:

 I've been distracted by other things, but got back to this today...
 
 On Wed, 2011-07-06 at 16:58 +0200, Richard Guenther wrote:
  Ah, so we still have the ARRAY_REFs here.  Yeah, well - then the
  issue boils down to get_inner_reference canonicalizing the offset
  according to what fold-const.c implements, and we simply emit
  the offset in that form (if you look at the normal_inner_ref: case
  in expand_expr_real*).  Thus with the above form at expansion
  time the matching would be applied there (or during our RTL
  address-generation in fwprop.c ...).
 
 I put together a simple patch to match the specific pattern from the
 original problem report in expand_expr_real_1.  This returns the code
 generation for that specific pattern to what it was a few years ago, but
 has little effect on SPEC cpu2000 results.  A small handful of
 benchmarks are improved, but most results are in the noise range.  So
 solving the original problem alone doesn't account for very much of the
 improvement we're getting with the full address lowering.
 
 Note that this only converted the basic pattern; I did not attempt to do
 the extra zero-offset mem-ref optimization, which I would have had to
 rewrite for an RTL phase.  I don't see much point in pursuing that,
 given these results.
 
  
  Another idea is of course to lower all handled_component_p
  memory accesses - something I did on the old mem-ref branch
  and I believe I also suggested at some point.  Without such lowering
  all the address-generation isn't exposed to the GIMPLE optimizers.
  
  The simplest way of doing the lowering is to do sth like
  
base = get_inner_reference (rhs, ..., bitpos, offset, ...);
base = fold_build2 (POINTER_PLUS_EXPR, ...,
 base, offset);
base = force_gimple_operand (base, ... is_gimple_mem_ref_addr);
tem = fold_build2 (MEM_REF, TREE_TYPE (rhs),
   base,
   build_int_cst (get_alias_ptr_type (rhs), 
  bitpos));
  
  at some point.  I didn't end up doing that because of the loss of
  alias information.
 
 My next experiment will be to try something simple along these lines.
 I'll look at the old mem-ref branch to see what you were doing there.
 It will be interesting to see whether the gains generally outweigh the
 small loss of TBAA information, as we saw using the affine-combination
 approach.
 
 As an FYI, this is the patch I used for my experiment.  Let me know if
 you see anything horrifically wrong that might have impacted the
 results.  I didn't make any attempt to figure out whether the pattern
 rewrite is appropriate for the target machine, since this was just an
 experiment.

I wonder if the code below triggered at all as since we expand from
SSA we no longer see the larger trees in-place but you have to
look them up via SSA defs using get_gimple_for_ssa_name (or the
helper get_def_for_expr).  So I expect that the check for a MULT_EXPR
offset only will trigger if the memory reference was still in the
form of a variable ARRAY_REF (then get_inner_reference will produce
an offset of the form i * sizeof (element)).  I think what we want
to catch here is also the case of *(ptr + i) which will not show
up as ARRAY_REF but as MEM[ptr + 0] where ptr is an SSA name
with a defining statement doing ptr' + i * sizeof (element).

Richard.

 Thanks,
 Bill
 
 Index: gcc/expr.c
 ===
 --- gcc/expr.c(revision 176422)
 +++ gcc/expr.c(working copy)
 @@ -7152,7 +7152,64 @@ expand_constructor (tree exp, rtx target, enum exp
return target;
  }
  
 +/* Look for specific patterns in the inner reference BASE and its
 +   OFFSET expression.  If found, return a single tree of type EXPTYPE
 +   that reassociates the addressability to combine constants.  */
 +static tree
 +reassociate_base_and_offset (tree base, tree offset, tree exptype)
 +{
 +  tree mult_op0, mult_op1, t1, t2;
 +  tree mult_expr, add_expr, mem_ref;
 +  tree offset_type;
 +  HOST_WIDE_INT c1, c2, c3, c;
  
 +  /* Currently we look for the following pattern, where Tj is an
 + arbitrary tree, and Ck is an integer constant:
 +
 + base:MEM_REF (T1, C1)
 + offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
 +
 + This is converted to:
 +
 +  MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
 +   C1 + C2*C3)
 +  */
 +  if (!base
 +  || !offset
 +  || TREE_CODE (base) != MEM_REF
 +  || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
 +  || TREE_CODE (offset) != MULT_EXPR)
 +return NULL_TREE;
 +
 +  mult_op0 = TREE_OPERAND (offset, 0);
 +  mult_op1 = TREE_OPERAND (offset, 1);
 +
 +  if (TREE_CODE (mult_op0) != PLUS_EXPR
 +  || TREE_CODE (mult_op1) != INTEGER_CST
 +  || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
 +return NULL_TREE;
 +
 +  t1 = TREE_OPERAND (base, 0);
 +  t2 

Re: [PATCH] Address lowering [1/3] Main patch

2011-07-19 Thread William J. Schmidt
I've been distracted by other things, but got back to this today...

On Wed, 2011-07-06 at 16:58 +0200, Richard Guenther wrote:
 Ah, so we still have the ARRAY_REFs here.  Yeah, well - then the
 issue boils down to get_inner_reference canonicalizing the offset
 according to what fold-const.c implements, and we simply emit
 the offset in that form (if you look at the normal_inner_ref: case
 in expand_expr_real*).  Thus with the above form at expansion
 time the matching would be applied there (or during our RTL
 address-generation in fwprop.c ...).

I put together a simple patch to match the specific pattern from the
original problem report in expand_expr_real_1.  This returns the code
generation for that specific pattern to what it was a few years ago, but
has little effect on SPEC cpu2000 results.  A small handful of
benchmarks are improved, but most results are in the noise range.  So
solving the original problem alone doesn't account for very much of the
improvement we're getting with the full address lowering.

Note that this only converted the basic pattern; I did not attempt to do
the extra zero-offset mem-ref optimization, which I would have had to
rewrite for an RTL phase.  I don't see much point in pursuing that,
given these results.

 
 Another idea is of course to lower all handled_component_p
 memory accesses - something I did on the old mem-ref branch
 and I believe I also suggested at some point.  Without such lowering
 all the address-generation isn't exposed to the GIMPLE optimizers.
 
 The simplest way of doing the lowering is to do sth like
 
   base = get_inner_reference (rhs, ..., bitpos, offset, ...);
   base = fold_build2 (POINTER_PLUS_EXPR, ...,
base, offset);
   base = force_gimple_operand (base, ... is_gimple_mem_ref_addr);
   tem = fold_build2 (MEM_REF, TREE_TYPE (rhs),
  base,
  build_int_cst (get_alias_ptr_type (rhs), 
 bitpos));
 
 at some point.  I didn't end up doing that because of the loss of
 alias information.

My next experiment will be to try something simple along these lines.
I'll look at the old mem-ref branch to see what you were doing there.
It will be interesting to see whether the gains generally outweigh the
small loss of TBAA information, as we saw using the affine-combination
approach.

As an FYI, this is the patch I used for my experiment.  Let me know if
you see anything horrifically wrong that might have impacted the
results.  I didn't make any attempt to figure out whether the pattern
rewrite is appropriate for the target machine, since this was just an
experiment.

Thanks,
Bill

Index: gcc/expr.c
===
--- gcc/expr.c  (revision 176422)
+++ gcc/expr.c  (working copy)
@@ -7152,7 +7152,64 @@ expand_constructor (tree exp, rtx target, enum exp
   return target;
 }
 
+/* Look for specific patterns in the inner reference BASE and its
+   OFFSET expression.  If found, return a single tree of type EXPTYPE
+   that reassociates the addressability to combine constants.  */
+static tree
+reassociate_base_and_offset (tree base, tree offset, tree exptype)
+{
+  tree mult_op0, mult_op1, t1, t2;
+  tree mult_expr, add_expr, mem_ref;
+  tree offset_type;
+  HOST_WIDE_INT c1, c2, c3, c;
 
+  /* Currently we look for the following pattern, where Tj is an
+ arbitrary tree, and Ck is an integer constant:
+
+ base:MEM_REF (T1, C1)
+ offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
+
+ This is converted to:
+
+  MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
+   C1 + C2*C3)
+  */
+  if (!base
+  || !offset
+  || TREE_CODE (base) != MEM_REF
+  || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
+  || TREE_CODE (offset) != MULT_EXPR)
+return NULL_TREE;
+
+  mult_op0 = TREE_OPERAND (offset, 0);
+  mult_op1 = TREE_OPERAND (offset, 1);
+
+  if (TREE_CODE (mult_op0) != PLUS_EXPR
+  || TREE_CODE (mult_op1) != INTEGER_CST
+  || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
+return NULL_TREE;
+
+  t1 = TREE_OPERAND (base, 0);
+  t2 = TREE_OPERAND (mult_op0, 0);
+  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
+  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
+  c3 = TREE_INT_CST_LOW (mult_op1);
+  c = c1 + c2 * c3;
+
+  offset_type = TREE_TYPE (TREE_OPERAND (base, 1));
+
+  mult_expr = build2 (MULT_EXPR, sizetype, t2, 
+ build_int_cst (offset_type, c3));
+
+  add_expr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr);
+
+  mem_ref = build2 (MEM_REF, exptype, add_expr, 
+   build_int_cst (offset_type, c));
+
+  return mem_ref;
+}
+
+
 /* expand_expr: generate code for computing expression EXP.
An rtx for the computed value is returned.  The value is never null.
In the case of a void EXP, const0_rtx is returned.
@@ -9034,7 +9091,7 @@ expand_expr_real_1 (tree exp, rtx target, enum mac

Re: [PATCH] Address lowering [1/3] Main patch

2011-07-08 Thread William J. Schmidt
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:
 Hi,
 
 On Mon, 4 Jul 2011, Richard Guenther wrote:
 
  I still do not like the implementation of yet another CSE machinery
  given that we already have two.
 
 From reading it it really seems to be a normal block-local CSE, without 
 anything fancy.  Hence, moving the pass just a little earlier (before 
 pass_vrp/pass_dominator) should already provide for all optimizations.  If 
 not those should be improved.
 
 I see that it is used for also getting rid of the zero-offset statements 
 in case non-zero-offsets follow.  I think that's generally worthwhile so 
 probably should be done in one of the above optimizers.

Just FYI, I've verified that this works as expected; the zero-offset
optimization can be moved into the dom2 pass without too much
difficulty, and the CSE in the dom2 pass is sufficient for what I've
seen in my limited testing.  This reduces the size and complexity of the
patch considerably -- thanks!

If it turns out that we end up going this route, it will mean modifying
at least 11 more scan tests whose expected output out of dom2 changes.

However, I'll be turning my attention now to some of the alternate
implementations that Richard suggested, to see how much of the gains can
be obtained by other means.

Thanks,
Bill




Re: [PATCH] Address lowering [1/3] Main patch

2011-07-06 Thread Richard Guenther
On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
wschm...@linux.vnet.ibm.com wrote:
 (Sorry for the late response; yesterday was a holiday here.)

 On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
 On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  This is the first of three patches related to lowering addressing
  expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
  contains the new pass together with supporting changes in existing
  modules.  The second patch contains an independent change to the RTL
  forward propagator to keep it from undoing an optimization made in the
  first patch.  The third patch contains new test cases and changes to
  existing test cases.
 
  Although I've broken it up into three patches to make the review easier,
  it would be best to commit at least the first and third together to
  avoid regressions.  The second can stand alone.
 
  I've done regression tests on powerpc64 and x86_64, and have asked
  Andreas Krebbel to test against the IBM z (390) platform.  I've done
  performance regression testing on powerpc64.  The only performance
  regression of note is the 2% degradation to 188.ammp due to loss of
  field disambiguation information.  As discussed in another thread,
  fixing this introduces more complexity than it's worth.

 Are there also performance improvements?  What about code size?

 Yes, there are performance improvements.  I've been running cpu2000 on
 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
 between 1% and 9%, with 187.facerec showing the largest improvements for
 both 32 and 64.

 I don't have formal code size results, but anecdotally from code
 crawling, I have seen code size either neutral or slightly improved.
 The largest code size improvements I've seen were on 32-bit code where
 the commoning allowed removal of a number of sign-extend and zero-extend
 instructions that were otherwise not seen to be redundant.

 I tried to get an understanding to what kind of optimizations this patch
 produces based on the test of testcases you added, but I have a hard
 time here.  Can you outline some please?


 The primary goal is to clean up code such as is shown in the original
 post of PR46556.  In late 2007 there were some changes made to address
 canonicalization that caused the code gen to be suboptimal on powerpc64.
 At that time you and others suggested a pattern recognizer prior to
 expand as probably the best solution, similar to what IVopts is doing.

The PR46556 case looks quite simple.

 By using the same mem_ref generation machinery used by IVopts, together
 with local CSE, the goal was to ensure base registers are properly
 shared so that optimal code is generated, particularly for cases of
 shared addressability to structures and arrays.  I also observed cases
 where it was useful to extend the sharing across the dominator tree.

As you are doing IV selection per individual statement only, using
the affine combination machinery looks quite a big hammer to me.
Especially as it is hard to imagine what the side-effects are, apart
from re-associating dependencies that do not fit the MEM-REF and
making the MEM-REF as complicated as permitted by the target.

What I thought originally when suggesting to do something similar
to IVOPTs was to build a list of candidates and uses and optimize
that set using a cost function similar to how IVOPTs does.

Doing addressing-mode selection locally per statement seems like
more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
(for its last invocation).  One pattern would be that of PR46556,
MEM[(p + ((n + 16)*4))] which we can transform to
TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
a single-use.  The TARGET_MEM_REF generation can easily
re-use the address-description and target-availability checks from
tree-ssa-address.c.  I would be at least interested in whether
handling the pattern from PR46556 alone (or maybe with a few
similar other cases) is responsible for the performance improvements.

Ideally we'd of course have a cost driven machinery that considers
(part of) the whole function.

 Secondarily, I noticed that once this was cleaned up we still had the
 suboptimal code generation for the zero-offset mem refs scenario
 outlined in the code commentary.  The extra logic to clean this up helps
 keep register pressure down.  I've observed some spill code reduction
 when this is in place.  Again, using expression availability from
 dominating blocks helps here in some cases as well.

Yeah, the case is quite odd and doesn't really fit existing optimizers
given that the CSE opportunity is hidden within the TARGET_MEM_REF ...

 I still do not like the implementation of yet another CSE machinery
 given that we already have two.  I think most of the need for CSE
 comes from the use of the affine combination framework and
 force_gimple_operand.  In fact I'd be interested to see cases 

Re: [PATCH] Address lowering [1/3] Main patch

2011-07-06 Thread William J. Schmidt
On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
 On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  (Sorry for the late response; yesterday was a holiday here.)
 
  On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
  On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
   This is the first of three patches related to lowering addressing
   expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
   contains the new pass together with supporting changes in existing
   modules.  The second patch contains an independent change to the RTL
   forward propagator to keep it from undoing an optimization made in the
   first patch.  The third patch contains new test cases and changes to
   existing test cases.
  
   Although I've broken it up into three patches to make the review easier,
   it would be best to commit at least the first and third together to
   avoid regressions.  The second can stand alone.
  
   I've done regression tests on powerpc64 and x86_64, and have asked
   Andreas Krebbel to test against the IBM z (390) platform.  I've done
   performance regression testing on powerpc64.  The only performance
   regression of note is the 2% degradation to 188.ammp due to loss of
   field disambiguation information.  As discussed in another thread,
   fixing this introduces more complexity than it's worth.
 
  Are there also performance improvements?  What about code size?
 
  Yes, there are performance improvements.  I've been running cpu2000 on
  32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
  between 1% and 9%, with 187.facerec showing the largest improvements for
  both 32 and 64.
 
  I don't have formal code size results, but anecdotally from code
  crawling, I have seen code size either neutral or slightly improved.
  The largest code size improvements I've seen were on 32-bit code where
  the commoning allowed removal of a number of sign-extend and zero-extend
  instructions that were otherwise not seen to be redundant.
 
  I tried to get an understanding to what kind of optimizations this patch
  produces based on the test of testcases you added, but I have a hard
  time here.  Can you outline some please?
 
 
  The primary goal is to clean up code such as is shown in the original
  post of PR46556.  In late 2007 there were some changes made to address
  canonicalization that caused the code gen to be suboptimal on powerpc64.
  At that time you and others suggested a pattern recognizer prior to
  expand as probably the best solution, similar to what IVopts is doing.
 
 The PR46556 case looks quite simple.

It certainly is.  I was personally curious whether there were other
suboptimal sequences that might be hiding out there, that a more general
approach might expose.  There was a comment at the end of the bugzilla
about a pass to expose target addressing modes in gimple for this
purpose.  When I first started looking at this, I looked for some
feedback from the community about whether that should be done, and got a
few favorable comments along with one negative one.  So that's how we
got on this road...

 
  By using the same mem_ref generation machinery used by IVopts, together
  with local CSE, the goal was to ensure base registers are properly
  shared so that optimal code is generated, particularly for cases of
  shared addressability to structures and arrays.  I also observed cases
  where it was useful to extend the sharing across the dominator tree.
 
 As you are doing IV selection per individual statement only, using
 the affine combination machinery looks quite a big hammer to me.
 Especially as it is hard to imagine what the side-effects are, apart
 from re-associating dependencies that do not fit the MEM-REF and
 making the MEM-REF as complicated as permitted by the target.
 
 What I thought originally when suggesting to do something similar
 to IVOPTs was to build a list of candidates and uses and optimize
 that set using a cost function similar to how IVOPTs does.
 
OK, reading back I can see that now...

 Doing addressing-mode selection locally per statement seems like
 more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
 (for its last invocation).  One pattern would be that of PR46556,
 MEM[(p + ((n + 16)*4))] which we can transform to
 TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
 a single-use.  The TARGET_MEM_REF generation can easily
 re-use the address-description and target-availability checks from
 tree-ssa-address.c.  I would be at least interested in whether
 handling the pattern from PR46556 alone (or maybe with a few
 similar other cases) is responsible for the performance improvements.
 
Hm, but I don't think forwprop sees the code in this form.  At the time
the last pass of forwprop runs, the gimple for the original problem is:

  D.1997_3 = p_1(D)-a[n_2(D)];
  D.1998_4 = p_1(D)-c[n_2(D)];
  D.1999_5 = 

Re: [PATCH] Address lowering [1/3] Main patch

2011-07-06 Thread Richard Guenther
On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt
wschm...@linux.vnet.ibm.com wrote:
 On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
 On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  (Sorry for the late response; yesterday was a holiday here.)
 
  On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
  On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
   This is the first of three patches related to lowering addressing
   expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
   contains the new pass together with supporting changes in existing
   modules.  The second patch contains an independent change to the RTL
   forward propagator to keep it from undoing an optimization made in the
   first patch.  The third patch contains new test cases and changes to
   existing test cases.
  
   Although I've broken it up into three patches to make the review easier,
   it would be best to commit at least the first and third together to
   avoid regressions.  The second can stand alone.
  
   I've done regression tests on powerpc64 and x86_64, and have asked
   Andreas Krebbel to test against the IBM z (390) platform.  I've done
   performance regression testing on powerpc64.  The only performance
   regression of note is the 2% degradation to 188.ammp due to loss of
   field disambiguation information.  As discussed in another thread,
   fixing this introduces more complexity than it's worth.
 
  Are there also performance improvements?  What about code size?
 
  Yes, there are performance improvements.  I've been running cpu2000 on
  32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
  between 1% and 9%, with 187.facerec showing the largest improvements for
  both 32 and 64.
 
  I don't have formal code size results, but anecdotally from code
  crawling, I have seen code size either neutral or slightly improved.
  The largest code size improvements I've seen were on 32-bit code where
  the commoning allowed removal of a number of sign-extend and zero-extend
  instructions that were otherwise not seen to be redundant.
 
  I tried to get an understanding to what kind of optimizations this patch
  produces based on the test of testcases you added, but I have a hard
  time here.  Can you outline some please?
 
 
  The primary goal is to clean up code such as is shown in the original
  post of PR46556.  In late 2007 there were some changes made to address
  canonicalization that caused the code gen to be suboptimal on powerpc64.
  At that time you and others suggested a pattern recognizer prior to
  expand as probably the best solution, similar to what IVopts is doing.

 The PR46556 case looks quite simple.

 It certainly is.  I was personally curious whether there were other
 suboptimal sequences that might be hiding out there, that a more general
 approach might expose.  There was a comment at the end of the bugzilla
 about a pass to expose target addressing modes in gimple for this
 purpose.  When I first started looking at this, I looked for some
 feedback from the community about whether that should be done, and got a
 few favorable comments along with one negative one.  So that's how we
 got on this road...


  By using the same mem_ref generation machinery used by IVopts, together
  with local CSE, the goal was to ensure base registers are properly
  shared so that optimal code is generated, particularly for cases of
  shared addressability to structures and arrays.  I also observed cases
  where it was useful to extend the sharing across the dominator tree.

 As you are doing IV selection per individual statement only, using
 the affine combination machinery looks quite a big hammer to me.
 Especially as it is hard to imagine what the side-effects are, apart
 from re-associating dependencies that do not fit the MEM-REF and
 making the MEM-REF as complicated as permitted by the target.

 What I thought originally when suggesting to do something similar
 to IVOPTs was to build a list of candidates and uses and optimize
 that set using a cost function similar to how IVOPTs does.

 OK, reading back I can see that now...

 Doing addressing-mode selection locally per statement seems like
 more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
 (for its last invocation).  One pattern would be that of PR46556,
 MEM[(p + ((n + 16)*4))] which we can transform to
 TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
 a single-use.  The TARGET_MEM_REF generation can easily
 re-use the address-description and target-availability checks from
 tree-ssa-address.c.  I would be at least interested in whether
 handling the pattern from PR46556 alone (or maybe with a few
 similar other cases) is responsible for the performance improvements.

 Hm, but I don't think forwprop sees the code in this form.  At the time
 the last pass of forwprop runs, the gimple for the 

Re: [PATCH] Address lowering [1/3] Main patch

2011-07-05 Thread William J. Schmidt
(Sorry for the late response; yesterday was a holiday here.)

On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
 On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  This is the first of three patches related to lowering addressing
  expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
  contains the new pass together with supporting changes in existing
  modules.  The second patch contains an independent change to the RTL
  forward propagator to keep it from undoing an optimization made in the
  first patch.  The third patch contains new test cases and changes to
  existing test cases.
 
  Although I've broken it up into three patches to make the review easier,
  it would be best to commit at least the first and third together to
  avoid regressions.  The second can stand alone.
 
  I've done regression tests on powerpc64 and x86_64, and have asked
  Andreas Krebbel to test against the IBM z (390) platform.  I've done
  performance regression testing on powerpc64.  The only performance
  regression of note is the 2% degradation to 188.ammp due to loss of
  field disambiguation information.  As discussed in another thread,
  fixing this introduces more complexity than it's worth.
 
 Are there also performance improvements?  What about code size?

Yes, there are performance improvements.  I've been running cpu2000 on
32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
between 1% and 9%, with 187.facerec showing the largest improvements for
both 32 and 64.

I don't have formal code size results, but anecdotally from code
crawling, I have seen code size either neutral or slightly improved.
The largest code size improvements I've seen were on 32-bit code where
the commoning allowed removal of a number of sign-extend and zero-extend
instructions that were otherwise not seen to be redundant.

 
 I tried to get an understanding to what kind of optimizations this patch
 produces based on the test of testcases you added, but I have a hard
 time here.  Can you outline some please?
 

The primary goal is to clean up code such as is shown in the original
post of PR46556.  In late 2007 there were some changes made to address
canonicalization that caused the code gen to be suboptimal on powerpc64.
At that time you and others suggested a pattern recognizer prior to
expand as probably the best solution, similar to what IVopts is doing.

By using the same mem_ref generation machinery used by IVopts, together
with local CSE, the goal was to ensure base registers are properly
shared so that optimal code is generated, particularly for cases of
shared addressability to structures and arrays.  I also observed cases
where it was useful to extend the sharing across the dominator tree.

Secondarily, I noticed that once this was cleaned up we still had the
suboptimal code generation for the zero-offset mem refs scenario
outlined in the code commentary.  The extra logic to clean this up helps
keep register pressure down.  I've observed some spill code reduction
when this is in place.  Again, using expression availability from
dominating blocks helps here in some cases as well.

 I still do not like the implementation of yet another CSE machinery
 given that we already have two.  I think most of the need for CSE
 comes from the use of the affine combination framework and
 force_gimple_operand.  In fact I'd be interested to see cases that
 are optimized that could not be handled by a combine-like
 pattern matcher?
 

I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
outline for this pass and asked for comments.  You indicated a concern
that the affine combination expansion would un-CSE a lot of expressions,
and that I should keep track of local candidates during this pass to
ensure that base registers, etc., would be properly shared.  It seemed
to me that a quick and dirty CSE of addressing expressions would be the
best way to handle that.  I posted another RFC on 2/24 with an early
implementation of CSE constrained to a single block, and indicated
shortly thereafter my intent to extend this to a dominator walk.  I
didn't receive any negative comments about using CSE at that time; but
then I didn't get much response at all.

I probably should have pushed harder to get comments at that point, but
I was very new to the community and was feeling my way through the
protocols; I didn't feel comfortable pushing.

In any case, yes, the primary need for CSE is as a result of the affine
combination framework, which creates a large amount of redundant code.
I can certainly take a look at Michael's suggestion to move the pass
earlier and allow pass-dominator to handle the cleanup, and add the
zero-offset mem-ref optimization to that or a subsequent pass.  Do you
agree that this would be a preferable approach?

 Thanks,
 Richard.




Re: [PATCH] Address lowering [1/3] Main patch

2011-07-05 Thread William J. Schmidt
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:
 Hi,
 
 On Mon, 4 Jul 2011, Richard Guenther wrote:
 
  I still do not like the implementation of yet another CSE machinery
  given that we already have two.
 
 From reading it it really seems to be a normal block-local CSE, without 
 anything fancy.  Hence, moving the pass just a little earlier (before 
 pass_vrp/pass_dominator) should already provide for all optimizations.  If 
 not those should be improved.
 
 I see that it is used for also getting rid of the zero-offset statements 
 in case non-zero-offsets follow.  I think that's generally worthwhile so 
 probably should be done in one of the above optimizers.

OK, thanks for the suggestion.  I can certainly look into this.

 
 You handle NOP_EXPR different from CONVERT_EXPR.  The middle-end doesn't 
 distinguish between them (yes, ignore the comment about one generating 
 code, the other not).
 

Ah, thanks, good to know.

 Your check for small types:
 
 + if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode)
 +   ref_found = true;
 
 You probably want != BLKmode .
 

OK.

 +  if (changed  is_zero_offset_ref (gimple_assign_lhs (stmt)))
 +VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
 +
 +  rhs1 = gimple_assign_rhs1_ptr (stmt);
 +  rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false);
 +
 +  /* Record zero-offset mem_refs on the RHS.  */
 +  if (rhs_changed  is_zero_offset_ref (gimple_assign_rhs1 (stmt)))
 +VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
 
 This possibly adds stmt twice to zero_offset_refs.  Do you really want 
 this?
 

Hm, I didn't think it was (currently) possible for a gimple statement to
have a mem-ref on both RHS and LHS.  Is that incorrect?  This is easily
changed if so, or if the possibility should be left open for the future.

 
 Ciao,
 Michael.

Thanks,
Bill



Re: [PATCH] Address lowering [1/3] Main patch

2011-07-05 Thread Michael Matz
Hi,

On Tue, 5 Jul 2011, William J. Schmidt wrote:

 Hm, I didn't think it was (currently) possible for a gimple statement to 
 have a mem-ref on both RHS and LHS.  Is that incorrect?  This is easily 
 changed if so, or if the possibility should be left open for the future.

Think aggregate copies:

void bla (struct S *d, struct S *s) { *d = *s; }


Ciao,
Michael.


Re: [PATCH] Address lowering [1/3] Main patch

2011-07-05 Thread William J. Schmidt
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:

 From reading it it really seems to be a normal block-local CSE, without 
 anything fancy.  Hence, moving the pass just a little earlier (before 
 pass_vrp/pass_dominator) should already provide for all optimizations.  If 
 not those should be improved.
 
I did a quick hack-up, and this looks like it will work well, at least
on the few simple examples I threw at it.

 I see that it is used for also getting rid of the zero-offset statements 
 in case non-zero-offsets follow.  I think that's generally worthwhile so 
 probably should be done in one of the above optimizers.
 
I will experiment with adding it to pass_dominator.  Looks like
tree-ssa-dom.c:optimize_stmt is the place to start.

Thanks,
Bill





Re: [PATCH] Address lowering [1/3] Main patch

2011-07-04 Thread Richard Guenther
On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
wschm...@linux.vnet.ibm.com wrote:
 This is the first of three patches related to lowering addressing
 expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
 contains the new pass together with supporting changes in existing
 modules.  The second patch contains an independent change to the RTL
 forward propagator to keep it from undoing an optimization made in the
 first patch.  The third patch contains new test cases and changes to
 existing test cases.

 Although I've broken it up into three patches to make the review easier,
 it would be best to commit at least the first and third together to
 avoid regressions.  The second can stand alone.

 I've done regression tests on powerpc64 and x86_64, and have asked
 Andreas Krebbel to test against the IBM z (390) platform.  I've done
 performance regression testing on powerpc64.  The only performance
 regression of note is the 2% degradation to 188.ammp due to loss of
 field disambiguation information.  As discussed in another thread,
 fixing this introduces more complexity than it's worth.

Are there also performance improvements?  What about code size?

I tried to get an understanding to what kind of optimizations this patch
produces based on the test of testcases you added, but I have a hard
time here.  Can you outline some please?

I still do not like the implementation of yet another CSE machinery
given that we already have two.  I think most of the need for CSE
comes from the use of the affine combination framework and
force_gimple_operand.  In fact I'd be interested to see cases that
are optimized that could not be handled by a combine-like
pattern matcher?

Thanks,
Richard.


Re: [PATCH] Address lowering [1/3] Main patch

2011-07-04 Thread Michael Matz
Hi,

On Mon, 4 Jul 2011, Richard Guenther wrote:

 I still do not like the implementation of yet another CSE machinery
 given that we already have two.

From reading it it really seems to be a normal block-local CSE, without 
anything fancy.  Hence, moving the pass just a little earlier (before 
pass_vrp/pass_dominator) should already provide for all optimizations.  If 
not those should be improved.

I see that it is used for also getting rid of the zero-offset statements 
in case non-zero-offsets follow.  I think that's generally worthwhile so 
probably should be done in one of the above optimizers.

You handle NOP_EXPR different from CONVERT_EXPR.  The middle-end doesn't 
distinguish between them (yes, ignore the comment about one generating 
code, the other not).

Your check for small types:

+ if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode)
+   ref_found = true;

You probably want != BLKmode .

+  if (changed  is_zero_offset_ref (gimple_assign_lhs (stmt)))
+VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
+
+  rhs1 = gimple_assign_rhs1_ptr (stmt);
+  rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false);
+
+  /* Record zero-offset mem_refs on the RHS.  */
+  if (rhs_changed  is_zero_offset_ref (gimple_assign_rhs1 (stmt)))
+VEC_safe_push (gimple, heap, zero_offset_refs, stmt);

This possibly adds stmt twice to zero_offset_refs.  Do you really want 
this?


Ciao,
Michael.