(Sorry for the late response; yesterday was a holiday here.) On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: > On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt > <wschm...@linux.vnet.ibm.com> wrote: > > This is the first of three patches related to lowering addressing > > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch > > contains the new pass together with supporting changes in existing > > modules. The second patch contains an independent change to the RTL > > forward propagator to keep it from undoing an optimization made in the > > first patch. The third patch contains new test cases and changes to > > existing test cases. > > > > Although I've broken it up into three patches to make the review easier, > > it would be best to commit at least the first and third together to > > avoid regressions. The second can stand alone. > > > > I've done regression tests on powerpc64 and x86_64, and have asked > > Andreas Krebbel to test against the IBM z (390) platform. I've done > > performance regression testing on powerpc64. The only performance > > regression of note is the 2% degradation to 188.ammp due to loss of > > field disambiguation information. As discussed in another thread, > > fixing this introduces more complexity than it's worth. > > Are there also performance improvements? What about code size?
Yes, there are performance improvements. I've been running cpu2000 on 32- and 64-bit powerpc64. Thirteen tests show measurable improvements between 1% and 9%, with 187.facerec showing the largest improvements for both 32 and 64. I don't have formal code size results, but anecdotally from code crawling, I have seen code size either neutral or slightly improved. The largest code size improvements I've seen were on 32-bit code where the commoning allowed removal of a number of sign-extend and zero-extend instructions that were otherwise not seen to be redundant. > > I tried to get an understanding to what kind of optimizations this patch > produces based on the test of testcases you added, but I have a hard > time here. Can you outline some please? > The primary goal is to clean up code such as is shown in the original post of PR46556. In late 2007 there were some changes made to address canonicalization that caused the code gen to be suboptimal on powerpc64. At that time you and others suggested a pattern recognizer prior to expand as probably the best solution, similar to what IVopts is doing. By using the same mem_ref generation machinery used by IVopts, together with local CSE, the goal was to ensure base registers are properly shared so that optimal code is generated, particularly for cases of shared addressability to structures and arrays. I also observed cases where it was useful to extend the sharing across the dominator tree. Secondarily, I noticed that once this was cleaned up we still had the suboptimal code generation for the zero-offset mem refs scenario outlined in the code commentary. The extra logic to clean this up helps keep register pressure down. I've observed some spill code reduction when this is in place. Again, using expression availability from dominating blocks helps here in some cases as well. > I still do not like the implementation of yet another CSE machinery > given that we already have two. I think most of the need for CSE > comes from the use of the affine combination framework and > force_gimple_operand. In fact I'd be interested to see cases that > are optimized that could not be handled by a combine-like > pattern matcher? > I'm somewhat puzzled by this comment. Back on 2/4, I posted a skeletal outline for this pass and asked for comments. You indicated a concern that the affine combination expansion would un-CSE a lot of expressions, and that I should keep track of local candidates during this pass to ensure that base registers, etc., would be properly shared. It seemed to me that a quick and dirty CSE of addressing expressions would be the best way to handle that. I posted another RFC on 2/24 with an early implementation of CSE constrained to a single block, and indicated shortly thereafter my intent to extend this to a dominator walk. I didn't receive any negative comments about using CSE at that time; but then I didn't get much response at all. I probably should have pushed harder to get comments at that point, but I was very new to the community and was feeling my way through the protocols; I didn't feel comfortable pushing. In any case, yes, the primary need for CSE is as a result of the affine combination framework, which creates a large amount of redundant code. I can certainly take a look at Michael's suggestion to move the pass earlier and allow pass-dominator to handle the cleanup, and add the zero-offset mem-ref optimization to that or a subsequent pass. Do you agree that this would be a preferable approach? > Thanks, > Richard.