Re: [PATCH] Address lowering [1/3] Main patch
On Tue, 19 Jul 2011, William J. Schmidt wrote: I've been distracted by other things, but got back to this today... On Wed, 2011-07-06 at 16:58 +0200, Richard Guenther wrote: Ah, so we still have the ARRAY_REFs here. Yeah, well - then the issue boils down to get_inner_reference canonicalizing the offset according to what fold-const.c implements, and we simply emit the offset in that form (if you look at the normal_inner_ref: case in expand_expr_real*). Thus with the above form at expansion time the matching would be applied there (or during our RTL address-generation in fwprop.c ...). I put together a simple patch to match the specific pattern from the original problem report in expand_expr_real_1. This returns the code generation for that specific pattern to what it was a few years ago, but has little effect on SPEC cpu2000 results. A small handful of benchmarks are improved, but most results are in the noise range. So solving the original problem alone doesn't account for very much of the improvement we're getting with the full address lowering. Note that this only converted the basic pattern; I did not attempt to do the extra zero-offset mem-ref optimization, which I would have had to rewrite for an RTL phase. I don't see much point in pursuing that, given these results. Another idea is of course to lower all handled_component_p memory accesses - something I did on the old mem-ref branch and I believe I also suggested at some point. Without such lowering all the address-generation isn't exposed to the GIMPLE optimizers. The simplest way of doing the lowering is to do sth like base = get_inner_reference (rhs, ..., bitpos, offset, ...); base = fold_build2 (POINTER_PLUS_EXPR, ..., base, offset); base = force_gimple_operand (base, ... is_gimple_mem_ref_addr); tem = fold_build2 (MEM_REF, TREE_TYPE (rhs), base, build_int_cst (get_alias_ptr_type (rhs), bitpos)); at some point. I didn't end up doing that because of the loss of alias information. My next experiment will be to try something simple along these lines. I'll look at the old mem-ref branch to see what you were doing there. It will be interesting to see whether the gains generally outweigh the small loss of TBAA information, as we saw using the affine-combination approach. As an FYI, this is the patch I used for my experiment. Let me know if you see anything horrifically wrong that might have impacted the results. I didn't make any attempt to figure out whether the pattern rewrite is appropriate for the target machine, since this was just an experiment. I wonder if the code below triggered at all as since we expand from SSA we no longer see the larger trees in-place but you have to look them up via SSA defs using get_gimple_for_ssa_name (or the helper get_def_for_expr). So I expect that the check for a MULT_EXPR offset only will trigger if the memory reference was still in the form of a variable ARRAY_REF (then get_inner_reference will produce an offset of the form i * sizeof (element)). I think what we want to catch here is also the case of *(ptr + i) which will not show up as ARRAY_REF but as MEM[ptr + 0] where ptr is an SSA name with a defining statement doing ptr' + i * sizeof (element). Richard. Thanks, Bill Index: gcc/expr.c === --- gcc/expr.c(revision 176422) +++ gcc/expr.c(working copy) @@ -7152,7 +7152,64 @@ expand_constructor (tree exp, rtx target, enum exp return target; } +/* Look for specific patterns in the inner reference BASE and its + OFFSET expression. If found, return a single tree of type EXPTYPE + that reassociates the addressability to combine constants. */ +static tree +reassociate_base_and_offset (tree base, tree offset, tree exptype) +{ + tree mult_op0, mult_op1, t1, t2; + tree mult_expr, add_expr, mem_ref; + tree offset_type; + HOST_WIDE_INT c1, c2, c3, c; + /* Currently we look for the following pattern, where Tj is an + arbitrary tree, and Ck is an integer constant: + + base:MEM_REF (T1, C1) + offset: MULT_EXPR (PLUS_EXPR (T2, C2), C3) + + This is converted to: + + MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)), + C1 + C2*C3) + */ + if (!base + || !offset + || TREE_CODE (base) != MEM_REF + || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST + || TREE_CODE (offset) != MULT_EXPR) +return NULL_TREE; + + mult_op0 = TREE_OPERAND (offset, 0); + mult_op1 = TREE_OPERAND (offset, 1); + + if (TREE_CODE (mult_op0) != PLUS_EXPR + || TREE_CODE (mult_op1) != INTEGER_CST + || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST) +return NULL_TREE; + + t1 = TREE_OPERAND (base, 0); + t2
Re: [PATCH] Address lowering [1/3] Main patch
I've been distracted by other things, but got back to this today... On Wed, 2011-07-06 at 16:58 +0200, Richard Guenther wrote: Ah, so we still have the ARRAY_REFs here. Yeah, well - then the issue boils down to get_inner_reference canonicalizing the offset according to what fold-const.c implements, and we simply emit the offset in that form (if you look at the normal_inner_ref: case in expand_expr_real*). Thus with the above form at expansion time the matching would be applied there (or during our RTL address-generation in fwprop.c ...). I put together a simple patch to match the specific pattern from the original problem report in expand_expr_real_1. This returns the code generation for that specific pattern to what it was a few years ago, but has little effect on SPEC cpu2000 results. A small handful of benchmarks are improved, but most results are in the noise range. So solving the original problem alone doesn't account for very much of the improvement we're getting with the full address lowering. Note that this only converted the basic pattern; I did not attempt to do the extra zero-offset mem-ref optimization, which I would have had to rewrite for an RTL phase. I don't see much point in pursuing that, given these results. Another idea is of course to lower all handled_component_p memory accesses - something I did on the old mem-ref branch and I believe I also suggested at some point. Without such lowering all the address-generation isn't exposed to the GIMPLE optimizers. The simplest way of doing the lowering is to do sth like base = get_inner_reference (rhs, ..., bitpos, offset, ...); base = fold_build2 (POINTER_PLUS_EXPR, ..., base, offset); base = force_gimple_operand (base, ... is_gimple_mem_ref_addr); tem = fold_build2 (MEM_REF, TREE_TYPE (rhs), base, build_int_cst (get_alias_ptr_type (rhs), bitpos)); at some point. I didn't end up doing that because of the loss of alias information. My next experiment will be to try something simple along these lines. I'll look at the old mem-ref branch to see what you were doing there. It will be interesting to see whether the gains generally outweigh the small loss of TBAA information, as we saw using the affine-combination approach. As an FYI, this is the patch I used for my experiment. Let me know if you see anything horrifically wrong that might have impacted the results. I didn't make any attempt to figure out whether the pattern rewrite is appropriate for the target machine, since this was just an experiment. Thanks, Bill Index: gcc/expr.c === --- gcc/expr.c (revision 176422) +++ gcc/expr.c (working copy) @@ -7152,7 +7152,64 @@ expand_constructor (tree exp, rtx target, enum exp return target; } +/* Look for specific patterns in the inner reference BASE and its + OFFSET expression. If found, return a single tree of type EXPTYPE + that reassociates the addressability to combine constants. */ +static tree +reassociate_base_and_offset (tree base, tree offset, tree exptype) +{ + tree mult_op0, mult_op1, t1, t2; + tree mult_expr, add_expr, mem_ref; + tree offset_type; + HOST_WIDE_INT c1, c2, c3, c; + /* Currently we look for the following pattern, where Tj is an + arbitrary tree, and Ck is an integer constant: + + base:MEM_REF (T1, C1) + offset: MULT_EXPR (PLUS_EXPR (T2, C2), C3) + + This is converted to: + + MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)), + C1 + C2*C3) + */ + if (!base + || !offset + || TREE_CODE (base) != MEM_REF + || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST + || TREE_CODE (offset) != MULT_EXPR) +return NULL_TREE; + + mult_op0 = TREE_OPERAND (offset, 0); + mult_op1 = TREE_OPERAND (offset, 1); + + if (TREE_CODE (mult_op0) != PLUS_EXPR + || TREE_CODE (mult_op1) != INTEGER_CST + || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST) +return NULL_TREE; + + t1 = TREE_OPERAND (base, 0); + t2 = TREE_OPERAND (mult_op0, 0); + c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1)); + c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1)); + c3 = TREE_INT_CST_LOW (mult_op1); + c = c1 + c2 * c3; + + offset_type = TREE_TYPE (TREE_OPERAND (base, 1)); + + mult_expr = build2 (MULT_EXPR, sizetype, t2, + build_int_cst (offset_type, c3)); + + add_expr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr); + + mem_ref = build2 (MEM_REF, exptype, add_expr, + build_int_cst (offset_type, c)); + + return mem_ref; +} + + /* expand_expr: generate code for computing expression EXP. An rtx for the computed value is returned. The value is never null. In the case of a void EXP, const0_rtx is returned. @@ -9034,7 +9091,7 @@ expand_expr_real_1 (tree exp, rtx target, enum mac
Re: [PATCH] Address lowering [1/3] Main patch
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote: Hi, On Mon, 4 Jul 2011, Richard Guenther wrote: I still do not like the implementation of yet another CSE machinery given that we already have two. From reading it it really seems to be a normal block-local CSE, without anything fancy. Hence, moving the pass just a little earlier (before pass_vrp/pass_dominator) should already provide for all optimizations. If not those should be improved. I see that it is used for also getting rid of the zero-offset statements in case non-zero-offsets follow. I think that's generally worthwhile so probably should be done in one of the above optimizers. Just FYI, I've verified that this works as expected; the zero-offset optimization can be moved into the dom2 pass without too much difficulty, and the CSE in the dom2 pass is sufficient for what I've seen in my limited testing. This reduces the size and complexity of the patch considerably -- thanks! If it turns out that we end up going this route, it will mean modifying at least 11 more scan tests whose expected output out of dom2 changes. However, I'll be turning my attention now to some of the alternate implementations that Richard suggested, to see how much of the gains can be obtained by other means. Thanks, Bill
Re: [PATCH] Address lowering [1/3] Main patch
On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: (Sorry for the late response; yesterday was a holiday here.) On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This is the first of three patches related to lowering addressing expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch contains the new pass together with supporting changes in existing modules. The second patch contains an independent change to the RTL forward propagator to keep it from undoing an optimization made in the first patch. The third patch contains new test cases and changes to existing test cases. Although I've broken it up into three patches to make the review easier, it would be best to commit at least the first and third together to avoid regressions. The second can stand alone. I've done regression tests on powerpc64 and x86_64, and have asked Andreas Krebbel to test against the IBM z (390) platform. I've done performance regression testing on powerpc64. The only performance regression of note is the 2% degradation to 188.ammp due to loss of field disambiguation information. As discussed in another thread, fixing this introduces more complexity than it's worth. Are there also performance improvements? What about code size? Yes, there are performance improvements. I've been running cpu2000 on 32- and 64-bit powerpc64. Thirteen tests show measurable improvements between 1% and 9%, with 187.facerec showing the largest improvements for both 32 and 64. I don't have formal code size results, but anecdotally from code crawling, I have seen code size either neutral or slightly improved. The largest code size improvements I've seen were on 32-bit code where the commoning allowed removal of a number of sign-extend and zero-extend instructions that were otherwise not seen to be redundant. I tried to get an understanding to what kind of optimizations this patch produces based on the test of testcases you added, but I have a hard time here. Can you outline some please? The primary goal is to clean up code such as is shown in the original post of PR46556. In late 2007 there were some changes made to address canonicalization that caused the code gen to be suboptimal on powerpc64. At that time you and others suggested a pattern recognizer prior to expand as probably the best solution, similar to what IVopts is doing. The PR46556 case looks quite simple. By using the same mem_ref generation machinery used by IVopts, together with local CSE, the goal was to ensure base registers are properly shared so that optimal code is generated, particularly for cases of shared addressability to structures and arrays. I also observed cases where it was useful to extend the sharing across the dominator tree. As you are doing IV selection per individual statement only, using the affine combination machinery looks quite a big hammer to me. Especially as it is hard to imagine what the side-effects are, apart from re-associating dependencies that do not fit the MEM-REF and making the MEM-REF as complicated as permitted by the target. What I thought originally when suggesting to do something similar to IVOPTs was to build a list of candidates and uses and optimize that set using a cost function similar to how IVOPTs does. Doing addressing-mode selection locally per statement seems like more a task for a few pattern matchers, for example in tree-ssa-forwprop.c (for its last invocation). One pattern would be that of PR46556, MEM[(p + ((n + 16)*4))] which we can transform to TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was a single-use. The TARGET_MEM_REF generation can easily re-use the address-description and target-availability checks from tree-ssa-address.c. I would be at least interested in whether handling the pattern from PR46556 alone (or maybe with a few similar other cases) is responsible for the performance improvements. Ideally we'd of course have a cost driven machinery that considers (part of) the whole function. Secondarily, I noticed that once this was cleaned up we still had the suboptimal code generation for the zero-offset mem refs scenario outlined in the code commentary. The extra logic to clean this up helps keep register pressure down. I've observed some spill code reduction when this is in place. Again, using expression availability from dominating blocks helps here in some cases as well. Yeah, the case is quite odd and doesn't really fit existing optimizers given that the CSE opportunity is hidden within the TARGET_MEM_REF ... I still do not like the implementation of yet another CSE machinery given that we already have two. I think most of the need for CSE comes from the use of the affine combination framework and force_gimple_operand. In fact I'd be interested to see cases
Re: [PATCH] Address lowering [1/3] Main patch
On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote: On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: (Sorry for the late response; yesterday was a holiday here.) On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This is the first of three patches related to lowering addressing expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch contains the new pass together with supporting changes in existing modules. The second patch contains an independent change to the RTL forward propagator to keep it from undoing an optimization made in the first patch. The third patch contains new test cases and changes to existing test cases. Although I've broken it up into three patches to make the review easier, it would be best to commit at least the first and third together to avoid regressions. The second can stand alone. I've done regression tests on powerpc64 and x86_64, and have asked Andreas Krebbel to test against the IBM z (390) platform. I've done performance regression testing on powerpc64. The only performance regression of note is the 2% degradation to 188.ammp due to loss of field disambiguation information. As discussed in another thread, fixing this introduces more complexity than it's worth. Are there also performance improvements? What about code size? Yes, there are performance improvements. I've been running cpu2000 on 32- and 64-bit powerpc64. Thirteen tests show measurable improvements between 1% and 9%, with 187.facerec showing the largest improvements for both 32 and 64. I don't have formal code size results, but anecdotally from code crawling, I have seen code size either neutral or slightly improved. The largest code size improvements I've seen were on 32-bit code where the commoning allowed removal of a number of sign-extend and zero-extend instructions that were otherwise not seen to be redundant. I tried to get an understanding to what kind of optimizations this patch produces based on the test of testcases you added, but I have a hard time here. Can you outline some please? The primary goal is to clean up code such as is shown in the original post of PR46556. In late 2007 there were some changes made to address canonicalization that caused the code gen to be suboptimal on powerpc64. At that time you and others suggested a pattern recognizer prior to expand as probably the best solution, similar to what IVopts is doing. The PR46556 case looks quite simple. It certainly is. I was personally curious whether there were other suboptimal sequences that might be hiding out there, that a more general approach might expose. There was a comment at the end of the bugzilla about a pass to expose target addressing modes in gimple for this purpose. When I first started looking at this, I looked for some feedback from the community about whether that should be done, and got a few favorable comments along with one negative one. So that's how we got on this road... By using the same mem_ref generation machinery used by IVopts, together with local CSE, the goal was to ensure base registers are properly shared so that optimal code is generated, particularly for cases of shared addressability to structures and arrays. I also observed cases where it was useful to extend the sharing across the dominator tree. As you are doing IV selection per individual statement only, using the affine combination machinery looks quite a big hammer to me. Especially as it is hard to imagine what the side-effects are, apart from re-associating dependencies that do not fit the MEM-REF and making the MEM-REF as complicated as permitted by the target. What I thought originally when suggesting to do something similar to IVOPTs was to build a list of candidates and uses and optimize that set using a cost function similar to how IVOPTs does. OK, reading back I can see that now... Doing addressing-mode selection locally per statement seems like more a task for a few pattern matchers, for example in tree-ssa-forwprop.c (for its last invocation). One pattern would be that of PR46556, MEM[(p + ((n + 16)*4))] which we can transform to TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was a single-use. The TARGET_MEM_REF generation can easily re-use the address-description and target-availability checks from tree-ssa-address.c. I would be at least interested in whether handling the pattern from PR46556 alone (or maybe with a few similar other cases) is responsible for the performance improvements. Hm, but I don't think forwprop sees the code in this form. At the time the last pass of forwprop runs, the gimple for the original problem is: D.1997_3 = p_1(D)-a[n_2(D)]; D.1998_4 = p_1(D)-c[n_2(D)]; D.1999_5 =
Re: [PATCH] Address lowering [1/3] Main patch
On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote: On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: (Sorry for the late response; yesterday was a holiday here.) On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This is the first of three patches related to lowering addressing expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch contains the new pass together with supporting changes in existing modules. The second patch contains an independent change to the RTL forward propagator to keep it from undoing an optimization made in the first patch. The third patch contains new test cases and changes to existing test cases. Although I've broken it up into three patches to make the review easier, it would be best to commit at least the first and third together to avoid regressions. The second can stand alone. I've done regression tests on powerpc64 and x86_64, and have asked Andreas Krebbel to test against the IBM z (390) platform. I've done performance regression testing on powerpc64. The only performance regression of note is the 2% degradation to 188.ammp due to loss of field disambiguation information. As discussed in another thread, fixing this introduces more complexity than it's worth. Are there also performance improvements? What about code size? Yes, there are performance improvements. I've been running cpu2000 on 32- and 64-bit powerpc64. Thirteen tests show measurable improvements between 1% and 9%, with 187.facerec showing the largest improvements for both 32 and 64. I don't have formal code size results, but anecdotally from code crawling, I have seen code size either neutral or slightly improved. The largest code size improvements I've seen were on 32-bit code where the commoning allowed removal of a number of sign-extend and zero-extend instructions that were otherwise not seen to be redundant. I tried to get an understanding to what kind of optimizations this patch produces based on the test of testcases you added, but I have a hard time here. Can you outline some please? The primary goal is to clean up code such as is shown in the original post of PR46556. In late 2007 there were some changes made to address canonicalization that caused the code gen to be suboptimal on powerpc64. At that time you and others suggested a pattern recognizer prior to expand as probably the best solution, similar to what IVopts is doing. The PR46556 case looks quite simple. It certainly is. I was personally curious whether there were other suboptimal sequences that might be hiding out there, that a more general approach might expose. There was a comment at the end of the bugzilla about a pass to expose target addressing modes in gimple for this purpose. When I first started looking at this, I looked for some feedback from the community about whether that should be done, and got a few favorable comments along with one negative one. So that's how we got on this road... By using the same mem_ref generation machinery used by IVopts, together with local CSE, the goal was to ensure base registers are properly shared so that optimal code is generated, particularly for cases of shared addressability to structures and arrays. I also observed cases where it was useful to extend the sharing across the dominator tree. As you are doing IV selection per individual statement only, using the affine combination machinery looks quite a big hammer to me. Especially as it is hard to imagine what the side-effects are, apart from re-associating dependencies that do not fit the MEM-REF and making the MEM-REF as complicated as permitted by the target. What I thought originally when suggesting to do something similar to IVOPTs was to build a list of candidates and uses and optimize that set using a cost function similar to how IVOPTs does. OK, reading back I can see that now... Doing addressing-mode selection locally per statement seems like more a task for a few pattern matchers, for example in tree-ssa-forwprop.c (for its last invocation). One pattern would be that of PR46556, MEM[(p + ((n + 16)*4))] which we can transform to TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was a single-use. The TARGET_MEM_REF generation can easily re-use the address-description and target-availability checks from tree-ssa-address.c. I would be at least interested in whether handling the pattern from PR46556 alone (or maybe with a few similar other cases) is responsible for the performance improvements. Hm, but I don't think forwprop sees the code in this form. At the time the last pass of forwprop runs, the gimple for the
Re: [PATCH] Address lowering [1/3] Main patch
(Sorry for the late response; yesterday was a holiday here.) On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This is the first of three patches related to lowering addressing expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch contains the new pass together with supporting changes in existing modules. The second patch contains an independent change to the RTL forward propagator to keep it from undoing an optimization made in the first patch. The third patch contains new test cases and changes to existing test cases. Although I've broken it up into three patches to make the review easier, it would be best to commit at least the first and third together to avoid regressions. The second can stand alone. I've done regression tests on powerpc64 and x86_64, and have asked Andreas Krebbel to test against the IBM z (390) platform. I've done performance regression testing on powerpc64. The only performance regression of note is the 2% degradation to 188.ammp due to loss of field disambiguation information. As discussed in another thread, fixing this introduces more complexity than it's worth. Are there also performance improvements? What about code size? Yes, there are performance improvements. I've been running cpu2000 on 32- and 64-bit powerpc64. Thirteen tests show measurable improvements between 1% and 9%, with 187.facerec showing the largest improvements for both 32 and 64. I don't have formal code size results, but anecdotally from code crawling, I have seen code size either neutral or slightly improved. The largest code size improvements I've seen were on 32-bit code where the commoning allowed removal of a number of sign-extend and zero-extend instructions that were otherwise not seen to be redundant. I tried to get an understanding to what kind of optimizations this patch produces based on the test of testcases you added, but I have a hard time here. Can you outline some please? The primary goal is to clean up code such as is shown in the original post of PR46556. In late 2007 there were some changes made to address canonicalization that caused the code gen to be suboptimal on powerpc64. At that time you and others suggested a pattern recognizer prior to expand as probably the best solution, similar to what IVopts is doing. By using the same mem_ref generation machinery used by IVopts, together with local CSE, the goal was to ensure base registers are properly shared so that optimal code is generated, particularly for cases of shared addressability to structures and arrays. I also observed cases where it was useful to extend the sharing across the dominator tree. Secondarily, I noticed that once this was cleaned up we still had the suboptimal code generation for the zero-offset mem refs scenario outlined in the code commentary. The extra logic to clean this up helps keep register pressure down. I've observed some spill code reduction when this is in place. Again, using expression availability from dominating blocks helps here in some cases as well. I still do not like the implementation of yet another CSE machinery given that we already have two. I think most of the need for CSE comes from the use of the affine combination framework and force_gimple_operand. In fact I'd be interested to see cases that are optimized that could not be handled by a combine-like pattern matcher? I'm somewhat puzzled by this comment. Back on 2/4, I posted a skeletal outline for this pass and asked for comments. You indicated a concern that the affine combination expansion would un-CSE a lot of expressions, and that I should keep track of local candidates during this pass to ensure that base registers, etc., would be properly shared. It seemed to me that a quick and dirty CSE of addressing expressions would be the best way to handle that. I posted another RFC on 2/24 with an early implementation of CSE constrained to a single block, and indicated shortly thereafter my intent to extend this to a dominator walk. I didn't receive any negative comments about using CSE at that time; but then I didn't get much response at all. I probably should have pushed harder to get comments at that point, but I was very new to the community and was feeling my way through the protocols; I didn't feel comfortable pushing. In any case, yes, the primary need for CSE is as a result of the affine combination framework, which creates a large amount of redundant code. I can certainly take a look at Michael's suggestion to move the pass earlier and allow pass-dominator to handle the cleanup, and add the zero-offset mem-ref optimization to that or a subsequent pass. Do you agree that this would be a preferable approach? Thanks, Richard.
Re: [PATCH] Address lowering [1/3] Main patch
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote: Hi, On Mon, 4 Jul 2011, Richard Guenther wrote: I still do not like the implementation of yet another CSE machinery given that we already have two. From reading it it really seems to be a normal block-local CSE, without anything fancy. Hence, moving the pass just a little earlier (before pass_vrp/pass_dominator) should already provide for all optimizations. If not those should be improved. I see that it is used for also getting rid of the zero-offset statements in case non-zero-offsets follow. I think that's generally worthwhile so probably should be done in one of the above optimizers. OK, thanks for the suggestion. I can certainly look into this. You handle NOP_EXPR different from CONVERT_EXPR. The middle-end doesn't distinguish between them (yes, ignore the comment about one generating code, the other not). Ah, thanks, good to know. Your check for small types: + if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode) + ref_found = true; You probably want != BLKmode . OK. + if (changed is_zero_offset_ref (gimple_assign_lhs (stmt))) +VEC_safe_push (gimple, heap, zero_offset_refs, stmt); + + rhs1 = gimple_assign_rhs1_ptr (stmt); + rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false); + + /* Record zero-offset mem_refs on the RHS. */ + if (rhs_changed is_zero_offset_ref (gimple_assign_rhs1 (stmt))) +VEC_safe_push (gimple, heap, zero_offset_refs, stmt); This possibly adds stmt twice to zero_offset_refs. Do you really want this? Hm, I didn't think it was (currently) possible for a gimple statement to have a mem-ref on both RHS and LHS. Is that incorrect? This is easily changed if so, or if the possibility should be left open for the future. Ciao, Michael. Thanks, Bill
Re: [PATCH] Address lowering [1/3] Main patch
Hi, On Tue, 5 Jul 2011, William J. Schmidt wrote: Hm, I didn't think it was (currently) possible for a gimple statement to have a mem-ref on both RHS and LHS. Is that incorrect? This is easily changed if so, or if the possibility should be left open for the future. Think aggregate copies: void bla (struct S *d, struct S *s) { *d = *s; } Ciao, Michael.
Re: [PATCH] Address lowering [1/3] Main patch
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote: From reading it it really seems to be a normal block-local CSE, without anything fancy. Hence, moving the pass just a little earlier (before pass_vrp/pass_dominator) should already provide for all optimizations. If not those should be improved. I did a quick hack-up, and this looks like it will work well, at least on the few simple examples I threw at it. I see that it is used for also getting rid of the zero-offset statements in case non-zero-offsets follow. I think that's generally worthwhile so probably should be done in one of the above optimizers. I will experiment with adding it to pass_dominator. Looks like tree-ssa-dom.c:optimize_stmt is the place to start. Thanks, Bill
Re: [PATCH] Address lowering [1/3] Main patch
On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This is the first of three patches related to lowering addressing expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch contains the new pass together with supporting changes in existing modules. The second patch contains an independent change to the RTL forward propagator to keep it from undoing an optimization made in the first patch. The third patch contains new test cases and changes to existing test cases. Although I've broken it up into three patches to make the review easier, it would be best to commit at least the first and third together to avoid regressions. The second can stand alone. I've done regression tests on powerpc64 and x86_64, and have asked Andreas Krebbel to test against the IBM z (390) platform. I've done performance regression testing on powerpc64. The only performance regression of note is the 2% degradation to 188.ammp due to loss of field disambiguation information. As discussed in another thread, fixing this introduces more complexity than it's worth. Are there also performance improvements? What about code size? I tried to get an understanding to what kind of optimizations this patch produces based on the test of testcases you added, but I have a hard time here. Can you outline some please? I still do not like the implementation of yet another CSE machinery given that we already have two. I think most of the need for CSE comes from the use of the affine combination framework and force_gimple_operand. In fact I'd be interested to see cases that are optimized that could not be handled by a combine-like pattern matcher? Thanks, Richard.
Re: [PATCH] Address lowering [1/3] Main patch
Hi, On Mon, 4 Jul 2011, Richard Guenther wrote: I still do not like the implementation of yet another CSE machinery given that we already have two. From reading it it really seems to be a normal block-local CSE, without anything fancy. Hence, moving the pass just a little earlier (before pass_vrp/pass_dominator) should already provide for all optimizations. If not those should be improved. I see that it is used for also getting rid of the zero-offset statements in case non-zero-offsets follow. I think that's generally worthwhile so probably should be done in one of the above optimizers. You handle NOP_EXPR different from CONVERT_EXPR. The middle-end doesn't distinguish between them (yes, ignore the comment about one generating code, the other not). Your check for small types: + if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode) + ref_found = true; You probably want != BLKmode . + if (changed is_zero_offset_ref (gimple_assign_lhs (stmt))) +VEC_safe_push (gimple, heap, zero_offset_refs, stmt); + + rhs1 = gimple_assign_rhs1_ptr (stmt); + rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false); + + /* Record zero-offset mem_refs on the RHS. */ + if (rhs_changed is_zero_offset_ref (gimple_assign_rhs1 (stmt))) +VEC_safe_push (gimple, heap, zero_offset_refs, stmt); This possibly adds stmt twice to zero_offset_refs. Do you really want this? Ciao, Michael.