(Sorry for the late response; yesterday was a holiday here.)

On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
> <wschm...@linux.vnet.ibm.com> wrote:
> > This is the first of three patches related to lowering addressing
> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
> > contains the new pass together with supporting changes in existing
> > modules.  The second patch contains an independent change to the RTL
> > forward propagator to keep it from undoing an optimization made in the
> > first patch.  The third patch contains new test cases and changes to
> > existing test cases.
> >
> > Although I've broken it up into three patches to make the review easier,
> > it would be best to commit at least the first and third together to
> > avoid regressions.  The second can stand alone.
> >
> > I've done regression tests on powerpc64 and x86_64, and have asked
> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
> > performance regression testing on powerpc64.  The only performance
> > regression of note is the 2% degradation to 188.ammp due to loss of
> > field disambiguation information.  As discussed in another thread,
> > fixing this introduces more complexity than it's worth.
> 
> Are there also performance improvements?  What about code size?

Yes, there are performance improvements.  I've been running cpu2000 on
32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
between 1% and 9%, with 187.facerec showing the largest improvements for
both 32 and 64.

I don't have formal code size results, but anecdotally from code
crawling, I have seen code size either neutral or slightly improved.
The largest code size improvements I've seen were on 32-bit code where
the commoning allowed removal of a number of sign-extend and zero-extend
instructions that were otherwise not seen to be redundant.

> 
> I tried to get an understanding to what kind of optimizations this patch
> produces based on the test of testcases you added, but I have a hard
> time here.  Can you outline some please?
> 

The primary goal is to clean up code such as is shown in the original
post of PR46556.  In late 2007 there were some changes made to address
canonicalization that caused the code gen to be suboptimal on powerpc64.
At that time you and others suggested a pattern recognizer prior to
expand as probably the best solution, similar to what IVopts is doing.

By using the same mem_ref generation machinery used by IVopts, together
with local CSE, the goal was to ensure base registers are properly
shared so that optimal code is generated, particularly for cases of
shared addressability to structures and arrays.  I also observed cases
where it was useful to extend the sharing across the dominator tree.

Secondarily, I noticed that once this was cleaned up we still had the
suboptimal code generation for the zero-offset mem refs scenario
outlined in the code commentary.  The extra logic to clean this up helps
keep register pressure down.  I've observed some spill code reduction
when this is in place.  Again, using expression availability from
dominating blocks helps here in some cases as well.

> I still do not like the implementation of yet another CSE machinery
> given that we already have two.  I think most of the need for CSE
> comes from the use of the affine combination framework and
> force_gimple_operand.  In fact I'd be interested to see cases that
> are optimized that could not be handled by a combine-like
> pattern matcher?
> 

I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
outline for this pass and asked for comments.  You indicated a concern
that the affine combination expansion would un-CSE a lot of expressions,
and that I should keep track of local candidates during this pass to
ensure that base registers, etc., would be properly shared.  It seemed
to me that a quick and dirty CSE of addressing expressions would be the
best way to handle that.  I posted another RFC on 2/24 with an early
implementation of CSE constrained to a single block, and indicated
shortly thereafter my intent to extend this to a dominator walk.  I
didn't receive any negative comments about using CSE at that time; but
then I didn't get much response at all.

I probably should have pushed harder to get comments at that point, but
I was very new to the community and was feeling my way through the
protocols; I didn't feel comfortable pushing.

In any case, yes, the primary need for CSE is as a result of the affine
combination framework, which creates a large amount of redundant code.
I can certainly take a look at Michael's suggestion to move the pass
earlier and allow pass-dominator to handle the cleanup, and add the
zero-offset mem-ref optimization to that or a subsequent pass.  Do you
agree that this would be a preferable approach?

> Thanks,
> Richard.


Reply via email to