[PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
This is a new version of the patch on "nested FMA". Sorry for updating this after so long, I've been studying and writing micro cases to sort out the cause of the regression. First, following previous discussion: (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html) 1. From testing more altered cases, I don't think the problem is that reassociation works locally. In that: 1) On the example with multiplications: tmp1 = a + c * c + d * d + x * y; tmp2 = x * tmp1; result += (a + c + d + tmp2); Given "result" rewritten by width=2, the performance is worse if we rewrite "tmp1" with width=2. In contrast, if we remove the multiplications from the example (and make "tmp1" not singe used), and still rewrite "result" by width=2, then rewriting "tmp1" with width=2 is better. (Make sense because the tree's depth at "result" is still smaller if we rewrite "tmp1".) 2) I tried to modify the assembly code of the example without FMA, so the width of "result" is 4. On Ampere1 there's no obvious improvement. So although this is an interesting problem, it doesn't seem like the cause of the regression. 2. From assembly code of the case with FMA, one problem is that, rewriting "tmp1" to parallel didn't decrease the minimum CPU cycles (taking MULT_EXPRs into account), but increased code size, so the overhead is increased. a) When "tmp1" is not re-written to parallel: fmadd d31, d2, d2, d30 fmadd d31, d3, d3, d31 fmadd d31, d4, d5, d31 //"tmp1" fmadd d31, d31, d4, d3 b) When "tmp1" is re-written to parallel: fmul d31, d4, d5 fmadd d27, d2, d2, d30 fmadd d31, d3, d3, d31 fadd d31, d31, d27 //"tmp1" fmadd d31, d31, d4, d3 For version a), there are 3 dependent FMAs to calculate "tmp1". For version b), there are also 3 dependent instructions in the longer path: the 1st, 3rd and 4th. So it seems to me the current get_reassociation_width algorithm isn't optimal in the presence of FMA. So I modified the patch to improve get_reassociation_width, rather than check for code patterns. (Although there could be some other complicated factors so the regression is more obvious when there's "nested FMA". But with this patch that should be avoided or reduced.) With this patch 508.namd_r 1-copy run has 7% improvement on Ampere1, on Intel Xeon there's about 3%. While I'm still collecting data on other CPUs, I'd like to know how do you think of this. About changes in the patch: 1. When the op list forms a complete FMA chain, try to search for a smaller width considering the benefit of using FMA. With a smaller width, the increment of code size is smaller when breaking the chain. 2. To avoid regressions, included the other patch (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html) on this tracker again. This is because more FMA will be kept with 1., so we need to rule out the loop dependent FMA chains when param_avoid_fma_max_bits is set. Thanks, Di Zhao PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p): New function to check whether ranking the ops results in better parallelism. (get_reassociation_width): Add new parameters. Search for smaller width considering the benefit of FMA. (rank_ops_for_fma): Change return value to be number of MULT_EXPRs. (reassociate_bb): For 3 ops, refine the condition to call swap_ops_for_binary_stmt. gcc/testsuite/ChangeLog: * gcc.dg/pr110279.c: New test. 0001-Consider-FMA-in-get_reassociation_width.patch Description: 0001-Consider-FMA-in-get_reassociation_width.patch
RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
> -Original Message- > From: Richard Biener > Sent: Thursday, August 31, 2023 8:23 PM > To: Di Zhao OS > Cc: Jeff Law ; Martin Jambor ; gcc- > patc...@gcc.gnu.org > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > reduce cross backedge FMA > > On Wed, Aug 30, 2023 at 11:33 AM Di Zhao OS > wrote: > > > > Hello Richard, > > > > > -Original Message- > > > From: Richard Biener > > > Sent: Tuesday, August 29, 2023 7:11 PM > > > To: Di Zhao OS > > > Cc: Jeff Law ; Martin Jambor ; > gcc- > > > patc...@gcc.gnu.org > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc > to > > > reduce cross backedge FMA > > > > > > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS > > > wrote: > > > > > > > > Hi, > > > > > > > > > -Original Message- > > > > > From: Richard Biener > > > > > Sent: Tuesday, August 29, 2023 4:09 PM > > > > > To: Di Zhao OS > > > > > Cc: Jeff Law ; Martin Jambor ; > > > gcc- > > > > > patc...@gcc.gnu.org > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in > reassoc > > > to > > > > > reduce cross backedge FMA > > > > > > > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > > -Original Message- > > > > > > > From: Richard Biener > > > > > > > Sent: Tuesday, August 29, 2023 3:41 PM > > > > > > > To: Jeff Law ; Martin Jambor > > > > > > > > Cc: Di Zhao OS ; gcc- > > > patc...@gcc.gnu.org > > > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in > > > reassoc > > > > > to > > > > > > > reduce cross backedge FMA > > > > > > > > > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > > > > > > ampere1 in the tracker. (Previous discussion is here: > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023- > July/624893.html) > > > > > > > > > > > > > > > > > > 1. Add testcases for the problem. For an op list in the form > of > > > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > > > > > > > Swap the operands so that more FMAs can be generated. > > > > > > > > > After widening_mul the result looks like: > > > > > > > > > > > > > > > > > > _1 = .FMA(a, b, acc_0); > > > > > > > > > acc_1 = .FMA(c, d, _1); > > > > > > > > > > > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > > > > > > widening_mul's result was like: > > > > > > > > > > > > > > > > > > _1 = a * b; > > > > > > > > > _2 = .FMA (c, d, _1); > > > > > > > > > acc_1 = acc_0 + _2; > > > > > > > > > > > > > > How can we execute the multiply and the FMA in parallel? They > > > > > > > depend on each other. Or is it the uarch can handle dependence > > > > > > > on the add operand but only when it is with a multiplication and > > > > > > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > > > > > > > > > > > Can you explain in more detail how the uarch executes one vs. the > > > > > > > other case? > > > > > > > > Here's my understanding after consulted our hardware team. For the > > > > second case, the uarch of some out-of-order processors can calculate > > > > "_2" of several loops at the same time, since ther
RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Hello Richard, > -Original Message- > From: Richard Biener > Sent: Tuesday, August 29, 2023 7:11 PM > To: Di Zhao OS > Cc: Jeff Law ; Martin Jambor ; gcc- > patc...@gcc.gnu.org > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > reduce cross backedge FMA > > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS > wrote: > > > > Hi, > > > > > -Original Message- > > > From: Richard Biener > > > Sent: Tuesday, August 29, 2023 4:09 PM > > > To: Di Zhao OS > > > Cc: Jeff Law ; Martin Jambor ; > gcc- > > > patc...@gcc.gnu.org > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc > to > > > reduce cross backedge FMA > > > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS > > > wrote: > > > > > > > > Hi, > > > > > > > > > -Original Message- > > > > > From: Richard Biener > > > > > Sent: Tuesday, August 29, 2023 3:41 PM > > > > > To: Jeff Law ; Martin Jambor > > > > > Cc: Di Zhao OS ; gcc- > patc...@gcc.gnu.org > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in > reassoc > > > to > > > > > reduce cross backedge FMA > > > > > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > > > > ampere1 in the tracker. (Previous discussion is here: > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) > > > > > > > > > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > > > > > Swap the operands so that more FMAs can be generated. > > > > > > > After widening_mul the result looks like: > > > > > > > > > > > > > > _1 = .FMA(a, b, acc_0); > > > > > > > acc_1 = .FMA(c, d, _1); > > > > > > > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > > > > widening_mul's result was like: > > > > > > > > > > > > > > _1 = a * b; > > > > > > > _2 = .FMA (c, d, _1); > > > > > > > acc_1 = acc_0 + _2; > > > > > > > > > > How can we execute the multiply and the FMA in parallel? They > > > > > depend on each other. Or is it the uarch can handle dependence > > > > > on the add operand but only when it is with a multiplication and > > > > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > > > > > > > Can you explain in more detail how the uarch executes one vs. the > > > > > other case? > > > > Here's my understanding after consulted our hardware team. For the > > second case, the uarch of some out-of-order processors can calculate > > "_2" of several loops at the same time, since there's no dependency > > among different iterations. While for the first case the next iteration > > has to wait for the current iteration to finish, so "acc_0"'s value is > > known. I assume it is also the case in some i386 processors, since I > > saw the patch "Deferring FMA transformations in tight loops" also > > changed corresponding files. > > That should be true for all kind of operations, no? Thus it means > reassoc should in general associate cross-iteration accumulation Yes I think both are true. > last? Historically we associated those first because that's how the > vectorizer liked to see them, but I think that's no longer necessary. > > It should be achievable by properly biasing the operand during > rank computation (don't we already do that?). The issue is related with the following codes (handling cases with three operands left): /* When there are three operands left, we want to make sure the ones that get the double binary op are chosen wisely. */ int len = ops.length (); if (len >= 3 && !has_fma) swap_ops_for_binary_stmt (ops, len - 3);
RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Hi, > -Original Message- > From: Richard Biener > Sent: Tuesday, August 29, 2023 4:09 PM > To: Di Zhao OS > Cc: Jeff Law ; Martin Jambor ; gcc- > patc...@gcc.gnu.org > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > reduce cross backedge FMA > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS > wrote: > > > > Hi, > > > > > -Original Message- > > > From: Richard Biener > > > Sent: Tuesday, August 29, 2023 3:41 PM > > > To: Jeff Law ; Martin Jambor > > > Cc: Di Zhao OS ; gcc-patches@gcc.gnu.org > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc > to > > > reduce cross backedge FMA > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > > > wrote: > > > > > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > > ampere1 in the tracker. (Previous discussion is here: > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) > > > > > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > > > Swap the operands so that more FMAs can be generated. > > > > > After widening_mul the result looks like: > > > > > > > > > > _1 = .FMA(a, b, acc_0); > > > > > acc_1 = .FMA(c, d, _1); > > > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > > widening_mul's result was like: > > > > > > > > > > _1 = a * b; > > > > > _2 = .FMA (c, d, _1); > > > > > acc_1 = acc_0 + _2; > > > > > > How can we execute the multiply and the FMA in parallel? They > > > depend on each other. Or is it the uarch can handle dependence > > > on the add operand but only when it is with a multiplication and > > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > > > Can you explain in more detail how the uarch executes one vs. the > > > other case? Here's my understanding after consulted our hardware team. For the second case, the uarch of some out-of-order processors can calculate "_2" of several loops at the same time, since there's no dependency among different iterations. While for the first case the next iteration has to wait for the current iteration to finish, so "acc_0"'s value is known. I assume it is also the case in some i386 processors, since I saw the patch "Deferring FMA transformations in tight loops" also changed corresponding files. > > > > > > > > If the code fragment is in a loop, some architecture can execute > > > > > the latter in parallel, so the performance can be much faster than > > > > > the former. For the small testcase, the performance gap is over > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as > > > > > much as possible. (If we are rewriting the op list into parallel, > > > > > no special treatment is needed, since the last statement after > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) > > > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross > > > > > backedge dependency. Added new enum fma_state to describe the > > > > > state of FMA candidates. > > > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy > > > > > run on ampere1. The compile options are: > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512". > > > > > > > > > > Best regards, > > > > > Di Zhao > > > > > > > > > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * tree-ssa-reassoc.cc (enum fma_state): New enum to > > > > > describe the state of FMA candidates for an op list. > > > > > (rewrite_expr_tree_parallel): Changed boolean > > > > > parameter to enum type. &g
RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Hi, > -Original Message- > From: Richard Biener > Sent: Tuesday, August 29, 2023 3:41 PM > To: Jeff Law ; Martin Jambor > Cc: Di Zhao OS ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > reduce cross backedge FMA > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > wrote: > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > This patch tries to fix the 2% regression in 510.parest_r on > > > ampere1 in the tracker. (Previous discussion is here: > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > Swap the operands so that more FMAs can be generated. > > > After widening_mul the result looks like: > > > > > > _1 = .FMA(a, b, acc_0); > > > acc_1 = .FMA(c, d, _1); > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > widening_mul's result was like: > > > > > > _1 = a * b; > > > _2 = .FMA (c, d, _1); > > > acc_1 = acc_0 + _2; > > How can we execute the multiply and the FMA in parallel? They > depend on each other. Or is it the uarch can handle dependence > on the add operand but only when it is with a multiplication and > not a FMA in some better ways? (I'd doubt so much complexity) > > Can you explain in more detail how the uarch executes one vs. the > other case? > > > > If the code fragment is in a loop, some architecture can execute > > > the latter in parallel, so the performance can be much faster than > > > the former. For the small testcase, the performance gap is over > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid > > > turning the last statement into FMA, and keep it a PLUS_EXPR as > > > much as possible. (If we are rewriting the op list into parallel, > > > no special treatment is needed, since the last statement after > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross > > > backedge dependency. Added new enum fma_state to describe the > > > state of FMA candidates. > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy > > > run on ampere1. The compile options are: > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512". > > > > > > Best regards, > > > Di Zhao > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > gcc/ChangeLog: > > > > > > * tree-ssa-reassoc.cc (enum fma_state): New enum to > > > describe the state of FMA candidates for an op list. > > > (rewrite_expr_tree_parallel): Changed boolean > > > parameter to enum type. > > > (result_feeds_back_from_phi_p): New function to check > > > for cross backedge dependency. > > > (rank_ops_for_fma): Return enum fma_state. Added new > > > parameter. > > > (reassociate_bb): If there's backedge dependency in an > > > op list, swap the operands before rewrite_expr_tree. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/pr110279.c: New test. > > Not a review, but more of a question -- isn't this transformation's > > profitability uarch sensitive. ie, just because it's bad for a set of > > aarch64 uarches, doesn't mean it's bad everywhere. > > > > And in general we shy away from trying to adjust gimple code based on > > uarch preferences. > > > > It seems the right place to do this is gimple->rtl expansion. > > Another comment is that FMA forming has this deferring code which I > think deals exactly with this kind of thing? CCing Martin who did this > work based on AMD uarchs also not wanting cross-loop dependences > on FMAs (or so). In particular I see > > if (fma_state.m_deferring_p > && fma_state.m_initial_phi) > { > gcc_checking_assert (fma_state.m_last_result); > if (!last_fma_candidate_feeds_initial_phi (_state, > _last_result_set)) > cancel_fma_deferring (_state); > > and I think code to avoid FMAs in other/related cases should be here > as well, like avoid forming back-to-back FMAs. The changes in this patch is controlled by "param_avoid_fma_max_bits", so I think it should only affect architectures with similar behavior. (The parameter was added in a previous patch "Deferring FMA transformations in tight loops", which seems to be dealing with the same issue.) > Richard. > > > Jeff Thanks, Di Zhao
[PATCH] alias-analyis: try to find ADDR_EXPR for SSA_NAME ptr
This patch tries to improve alias-analysis between an SSA_NAME and a declaration a little. For a case like: int array1[10], array2[10]; ptr1 = array1 + x; ptr2 = ptr1 + y; , *ptr2 should not alias with array2. If we can't disambiguate from points-to information, this patch tries to find a determined ADDR_EXPR following the defining statements and then disambiguate by compare_base_decls. On spec2017 502.gcc, there are several thousands of new non-aliasing relation found. (No obvious improvements or regressions found, though.) Bootstrapped and tested on aarch64-unknown-linux-gnu. Thanks, Di Zhao gcc/ChangeLog: * tree-ssa-alias.cc (ptr_deref_may_alias_decl_p): try to find ADDR_EXPR for SSA_NAME ptrs. --- gcc/tree-ssa-alias.cc | 42 ++ 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc index cf38fe506a8..a6fe1e7b227 100644 --- a/gcc/tree-ssa-alias.cc +++ b/gcc/tree-ssa-alias.cc @@ -271,6 +271,38 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl) return ptr_deref_may_alias_decl_p (ptr, decl); } + if (TREE_CODE (ptr) == SSA_NAME) +{ + /* First disambiguate from points-to information. */ + pi = SSA_NAME_PTR_INFO (ptr); + if (pi && !pt_solution_includes (>pt, decl)) + return false; + + /* Try to find an ADDR_EXPR by walking the defining statements, so we can +probably disambiguate from compare_base_decls. */ + gimple *def_stmt; + while ((def_stmt = SSA_NAME_DEF_STMT (ptr))) + { + if (is_gimple_assign (def_stmt) + && gimple_assign_rhs_code (def_stmt) == POINTER_PLUS_EXPR) + { + ptr = gimple_assign_rhs1 (def_stmt); + if (TREE_CODE (ptr) != SSA_NAME) + break; + } + /* See if we can find a certain defining source. */ + else if (gimple_code (def_stmt) == GIMPLE_PHI + && gimple_phi_num_args (def_stmt) == 1) + { + ptr = PHI_ARG_DEF (def_stmt, 0); + if (TREE_CODE (ptr) != SSA_NAME) + break; + } + else + break; + } +} + /* ADDR_EXPR pointers either just offset another pointer or directly specify the pointed-to set. */ if (TREE_CODE (ptr) == ADDR_EXPR) @@ -279,7 +311,7 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl) if (base && (TREE_CODE (base) == MEM_REF || TREE_CODE (base) == TARGET_MEM_REF)) - ptr = TREE_OPERAND (base, 0); + return ptr_deref_may_alias_decl_p (TREE_OPERAND (base, 0), decl); else if (base && DECL_P (base)) return compare_base_decls (base, decl) != 0; @@ -294,13 +326,7 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl) if (!may_be_aliased (decl)) return false; - /* If we do not have useful points-to information for this pointer - we cannot disambiguate anything else. */ - pi = SSA_NAME_PTR_INFO (ptr); - if (!pi) -return true; - - return pt_solution_includes (>pt, decl); + return true; } /* Return true if dereferenced PTR1 and PTR2 may alias. -- 2.25.1
[PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
This patch tries to fix the 2% regression in 510.parest_r on ampere1 in the tracker. (Previous discussion is here: https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) 1. Add testcases for the problem. For an op list in the form of "acc = a * b + c * d + acc", currently reassociation doesn't Swap the operands so that more FMAs can be generated. After widening_mul the result looks like: _1 = .FMA(a, b, acc_0); acc_1 = .FMA(c, d, _1); While previously (before the "Handle FMA friendly..." patch), widening_mul's result was like: _1 = a * b; _2 = .FMA (c, d, _1); acc_1 = acc_0 + _2; If the code fragment is in a loop, some architecture can execute the latter in parallel, so the performance can be much faster than the former. For the small testcase, the performance gap is over 10% on both ampere1 and neoverse-n1. So the point here is to avoid turning the last statement into FMA, and keep it a PLUS_EXPR as much as possible. (If we are rewriting the op list into parallel, no special treatment is needed, since the last statement after rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) 2. Function result_feeds_back_from_phi_p is to check for cross backedge dependency. Added new enum fma_state to describe the state of FMA candidates. With this patch, there's a 3% improvement in 510.parest_r 1-copy run on ampere1. The compile options are: "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512". Best regards, Di Zhao PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-reassoc.cc (enum fma_state): New enum to describe the state of FMA candidates for an op list. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (result_feeds_back_from_phi_p): New function to check for cross backedge dependency. (rank_ops_for_fma): Return enum fma_state. Added new parameter. (reassociate_bb): If there's backedge dependency in an op list, swap the operands before rewrite_expr_tree. gcc/testsuite/ChangeLog: * gcc.dg/pr110279.c: New test. 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch Description: 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch
RE: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
Hi, A few updates to the patch: 1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete FMA chain, since the regression is obvious only in this case. 2. Added new testcase. Thanks, Di Zhao PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter collect_lhs. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Added new parameter collect_lhs. * tree-ssa-math-opts.h (struct fma_transformation_info): (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Moved from .cc. * tree-ssa-reassoc.cc (enum fma_state): Defined enum to describe the state of FMA candidates for a list of operands. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (has_nested_fma_p): New function to check for nested FMA on given multiplication statement. (rank_ops_for_fma): Return enum fma_state. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. gcc/testsuite/ChangeLog: * gcc.dg/pr110279-1.c: New test. * gcc.dg/pr110279-2.c: New test. * gcc.dg/pr110279-3.c: New test. > -Original Message- > From: Di Zhao OS > Sent: Thursday, August 10, 2023 12:53 AM > To: gcc-patches@gcc.gnu.org > Cc: Richard Biener > Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc > > Hi, > > The previous version of this patch tries to solve two problems > at the same time. For better clarity, I'll separate them and > only deal with the "nested" FMA in this version. I plan to > propose another patch in avoiding bad shaped FMA (deferring FMA). > > Other changes: > > 1. Added new testcases for the "nested" FMA issue. For the >following code: > > tmp1 = a + c * c + d * d + x * y; > tmp2 = x * tmp1; > result += (a + c + d + tmp2); > >, when "tmp1 = ..." is not rewritten, tmp1 will be result of >an FMA, and there will be a list of consecutive FMAs: > > _1 = .FMA (c, c, a_39); > _2 = .FMA (d, d, _1); > tmp1 = .FMA (x, y, _2); > _3 = .FMA (tmp1, x, d); > ... > >If "tmp1 = ..." is rewritten to parallel, tmp1 will be result >of a PLUS_EXPR between FMAs: > > _1 = .FMA (c, c, a_39); > _2 = x * y; > _3 = .FMA (d, d, _2); >tmp1 = _3 + _1; >_4 = .FMA (tmp1, x, d); > ... > >It seems the register pressure of the latter is higher than >the former. On the test machines we have (including Ampere1, >Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to >parallel, the run time all increased significantly. In >contrast, when "tmp1" is not the 1st or 2nd operand of another >FMA (pr110279-1.c), rewriting it results in better performance. >(I'll also append the testcases in the bug tracker.) > > 2. Enhanced checking for nested FMA by: 1) Modified >convert_mult_to_fma so it can return multiple LHS. 2) Check >NEGATE_EXPRs for nested FMA. > > (I think maybe this can be further refined by enabling rewriting > to parallel for very long op list. ) > > Bootstrapped and regression tested on x86_64-linux-gnu. > > Thanks, > Di Zhao 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch Description: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch
[PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
Hi, The previous version of this patch tries to solve two problems at the same time. For better clarity, I'll separate them and only deal with the "nested" FMA in this version. I plan to propose another patch in avoiding bad shaped FMA (deferring FMA). Other changes: 1. Added new testcases for the "nested" FMA issue. For the following code: tmp1 = a + c * c + d * d + x * y; tmp2 = x * tmp1; result += (a + c + d + tmp2); , when "tmp1 = ..." is not rewritten, tmp1 will be result of an FMA, and there will be a list of consecutive FMAs: _1 = .FMA (c, c, a_39); _2 = .FMA (d, d, _1); tmp1 = .FMA (x, y, _2); _3 = .FMA (tmp1, x, d); ... If "tmp1 = ..." is rewritten to parallel, tmp1 will be result of a PLUS_EXPR between FMAs: _1 = .FMA (c, c, a_39); _2 = x * y; _3 = .FMA (d, d, _2); tmp1 = _3 + _1; _4 = .FMA (tmp1, x, d); ... It seems the register pressure of the latter is higher than the former. On the test machines we have (including Ampere1, Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to parallel, the run time all increased significantly. In contrast, when "tmp1" is not the 1st or 2nd operand of another FMA (pr110279-1.c), rewriting it results in better performance. (I'll also append the testcases in the bug tracker.) 2. Enhanced checking for nested FMA by: 1) Modified convert_mult_to_fma so it can return multiple LHS. 2) Check NEGATE_EXPRs for nested FMA. (I think maybe this can be further refined by enabling rewriting to parallel for very long op list. ) Bootstrapped and regression tested on x86_64-linux-gnu. Thanks, Di Zhao PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter collect_lhs. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Added new parameter collect_lhs. * tree-ssa-math-opts.h (struct fma_transformation_info): (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Moved from .cc. * tree-ssa-reassoc.cc (enum fma_state): Defined enum to describe the state of FMA candidates for a list of operands. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (has_nested_fma_p): New function to check for nested FMA on given multiplication statement. (rank_ops_for_fma): Return enum fma_state. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. gcc/testsuite/ChangeLog: * gcc.dg/pr110279-1.c: New test. * gcc.dg/pr110279-2.c: New test. tree-optimization-110279-Check-for-nested-FMA-in-rea.patch Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
RE: [PATCH] Change fma_reassoc_width tuning for ampere1
Cherry-picked this to gcc-13. Thanks, Di Zhao > -Original Message- > From: Richard Sandiford > Sent: Monday, June 26, 2023 10:28 PM > To: Philipp Tomsich > Cc: Di Zhao OS via Gcc-patches ; Di Zhao OS > > Subject: Re: [PATCH] Change fma_reassoc_width tuning for ampere1 > > Philipp Tomsich writes: > > Richard, > > > > OK for backport to GCC-13? > > Yeah, OK for GCC 13 too. > > Thanks, > Richard > > > Thanks, > > Philipp. > > > > On Thu, 22 Jun 2023 at 16:18, Richard Sandiford via Gcc-patches > > wrote: > >> > >> Di Zhao OS via Gcc-patches writes: > >> > This patch enables reassociation of floating-point additions on ampere1. > >> > This brings about 1% overall benefit on spec2017 fprate cases. (There > >> > are minor regressions in 510.parest_r and 508.namd_r, analyzed here: > >> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279 .) > >> > > >> > Bootstrapped and tested on aarch64-unknown-linux-gnu. Is this OK for > trunk? > >> > > >> > Thanks, > >> > Di Zhao > >> > > >> > gcc/ChangeLog: > >> > > >> > * config/aarch64/aarch64.cc: Change fma_reassoc_width for > ampere1 > >> > >> Thanks, pushed to trunk. > >> > >> Richard > >> > >> > --- > >> > diff --git a/gcc/config/aarch64/aarch64.cc > b/gcc/config/aarch64/aarch64.cc > >> > index d16565b5581..301c9f6c0cd 100644 > >> > --- a/gcc/config/aarch64/aarch64.cc > >> > +++ b/gcc/config/aarch64/aarch64.cc > >> > @@ -1927,7 +1927,7 @@ static const struct tune_params ampere1_tunings = > >> >"32:12", /* loop_align. */ > >> >2, /* int_reassoc_width. */ > >> >4, /* fp_reassoc_width. */ > >> > - 1, /* fma_reassoc_width. */ > >> > + 4, /* fma_reassoc_width. */ > >> >2, /* vec_reassoc_width. */ > >> >2, /* min_div_recip_mul_sf. */ > >> >2, /* min_div_recip_mul_df. */
[PATCH v2] tree-optimization/110279- Check for nested FMA chains in reassoc
Attached is an updated version of the patch. Based on Philipp's review, some changes: 1. Defined new enum fma_state to describe the state of FMA candidates for a list of operands. (Since the tests seems simple after the change, I didn't add predicates on it.) 2. Changed return type of convert_mult_to_fma_1 and convert_mult_to_fma to tree, to remove the in/out parameter. 3. Added description of return value values of rank_ops_for_fma. --- gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter check_only_p. Changed return type to tree. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Added new parameter check_only_p. Changed return type to tree. * tree-ssa-math-opts.h (struct fma_transformation_info): Moved from .cc. (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Add function decl. * tree-ssa-reassoc.cc (enum fma_state): Defined new enum to describe the state of FMA candidates for a list of operands. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (rank_ops_for_fma): Return enum fma_state. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. Thanks, Di Zhao 0001-Check-for-nested-FMA-chains-in-reassoc.patch Description: 0001-Check-for-nested-FMA-chains-in-reassoc.patch
[PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
Update the patch so it can apply. Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast -flto", the improvements of 1-copy run are: Ampere1: 508.namd_r 4.26% 510.parest_r2.55% Overall 0.54% Intel Xeon: 503.bwaves_r1.3% 508.namd_r 1.58% overall 0.42% Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Friday, June 16, 2023 4:51 PM > To: gcc-patches@gcc.gnu.org > Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in > reassoc > > This patch is to fix the regressions found in SPEC2017 fprate cases > on aarch64. > > 1. Reused code in pass widening_mul to check for nested FMA chains > (those connected by MULT_EXPRs), since re-writing to parallel > generates worse codes. > > 2. Avoid re-arrange to produce less FMA chains that can be slow. > > Tested on ampere1 and neoverse-n1, this fixed the regressions in > 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data > on x86 machines we have, I'd like to know what do you think of this. > > (Previously I tried to improve things with FMA by adding a widening_mul > pass before reassoc2 for it's easier to recognize different patterns > of FMA chains and decide whether to split them. But I suppose handling > them all in reassoc pass is more efficient.) > > Thanks, > Di Zhao > > --- > gcc/ChangeLog: > > * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter. > Support new mode that merely do the checking. > (struct fma_transformation_info): Moved to header. > (class fma_deferring_state): Moved to header. > (convert_mult_to_fma): Add new parameter. > * tree-ssa-math-opts.h (struct fma_transformation_info): > (class fma_deferring_state): Moved from .cc. > (convert_mult_to_fma): Add function decl. > * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): > (rank_ops_for_fma): Return -1 if nested FMAs are found. > (reassociate_bb): Avoid rewriting to parallel if nested FMAs are > found. 0001-Check-for-nested-FMA-chains-in-reassoc.patch Description: 0001-Check-for-nested-FMA-chains-in-reassoc.patch
[PATCH] Change fma_reassoc_width tuning for ampere1
This patch enables reassociation of floating-point additions on ampere1. This brings about 1% overall benefit on spec2017 fprate cases. (There are minor regressions in 510.parest_r and 508.namd_r, analyzed here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279 .) Bootstrapped and tested on aarch64-unknown-linux-gnu. Is this OK for trunk? Thanks, Di Zhao gcc/ChangeLog: * config/aarch64/aarch64.cc: Change fma_reassoc_width for ampere1 --- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index d16565b5581..301c9f6c0cd 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -1927,7 +1927,7 @@ static const struct tune_params ampere1_tunings = "32:12", /* loop_align. */ 2, /* int_reassoc_width. */ 4, /* fp_reassoc_width. */ - 1, /* fma_reassoc_width. */ + 4, /* fma_reassoc_width. */ 2, /* vec_reassoc_width. */ 2, /* min_div_recip_mul_sf. */ 2, /* min_div_recip_mul_df. */ -- 2.25.1
[PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
This patch is to fix the regressions found in SPEC2017 fprate cases on aarch64. 1. Reused code in pass widening_mul to check for nested FMA chains (those connected by MULT_EXPRs), since re-writing to parallel generates worse codes. 2. Avoid re-arrange to produce less FMA chains that can be slow. Tested on ampere1 and neoverse-n1, this fixed the regressions in 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data on x86 machines we have, I'd like to know what do you think of this. (Previously I tried to improve things with FMA by adding a widening_mul pass before reassoc2 for it's easier to recognize different patterns of FMA chains and decide whether to split them. But I suppose handling them all in reassoc pass is more efficient.) Thanks, Di Zhao --- gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter. Support new mode that merely do the checking. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Add new parameter. * tree-ssa-math-opts.h (struct fma_transformation_info): (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Add function decl. * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): (rank_ops_for_fma): Return -1 if nested FMAs are found. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. pr110279-Check-for-nested-FMA-chains-in-reassoc.diff Description: pr110279-Check-for-nested-FMA-chains-in-reassoc.diff
RE: [PATCH] Handle FMA friendly in reassoc pass
Hello Lili Cui, Since I'm also trying to improve this lately, I've tested your patch on several aarch64 machines we have, including neoverse-n1 and ampere1 architectures. However, I haven't reproduced the 6.00% improvement of 503.bwaves_r single copy run you mentioned. Could you share more information about the aarch64 CPU and compile options you tested? The option I'm using is "-Ofast", with or without "--param avoid-fma-max-bits=512". Additionally, we found some spec2017 cases with regressions, including 4% on 527.cam4_r (neoverse-n1). > -Original Message- > From: Gcc-patches bounces+dizhao=os.amperecomputing@gcc.gnu.org> On Behalf Of Cui, Lili via > Gcc-patches > Sent: Thursday, May 25, 2023 7:30 AM > To: gcc-patches@gcc.gnu.org > Cc: richard.guent...@gmail.com; li...@linux.ibm.com; Lili Cui > > Subject: [PATCH] Handle FMA friendly in reassoc pass > > From: Lili Cui > > Make some changes in reassoc pass to make it more friendly to fma pass later. > Using FMA instead of mult + add reduces register pressure and insruction > retired. > > There are mainly two changes > 1. Put no-mult ops and mult ops alternately at the end of the queue, which is > conducive to generating more fma and reducing the loss of FMA when breaking > the chain. > 2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel > chains according to the given correlation width, keeping the FMA chance as > much as possible. > > With the patch applied > > On ICX: > 507.cactuBSSN_r: Improved by 1.7% for multi-copy . > 503.bwaves_r : Improved by 0.60% for single copy . > 507.cactuBSSN_r: Improved by 1.10% for single copy . > 519.lbm_r : Improved by 2.21% for single copy . > no measurable changes for other benchmarks. > > On aarch64 > 507.cactuBSSN_r: Improved by 1.7% for multi-copy. > 503.bwaves_r : Improved by 6.00% for single-copy. > no measurable changes for other benchmarks. > > TEST1: > > float > foo (float a, float b, float c, float d, float *e) > { >return *e + a * b + c * d ; > } > > For "-Ofast -mfpmath=sse -mfma" GCC generates: > vmulss %xmm3, %xmm2, %xmm2 > vfmadd132ss %xmm1, %xmm2, %xmm0 > vaddss (%rdi), %xmm0, %xmm0 > ret > > With this patch GCC generates: > vfmadd213ss (%rdi), %xmm1, %xmm0 > vfmadd231ss %xmm2, %xmm3, %xmm0 > ret > > TEST2: > > for (int i = 0; i < N; i++) > { > a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] > + m[i]* o[i] + p[i]; > } > > For "-Ofast -mfpmath=sse -mfma" GCC generates: > vmovapd e(%rax), %ymm4 > vmulpd d(%rax), %ymm4, %ymm3 > addq$32, %rax > vmovapd c-32(%rax), %ymm5 > vmovapd j-32(%rax), %ymm6 > vmulpd h-32(%rax), %ymm6, %ymm2 > vmovapd a-32(%rax), %ymm6 > vaddpd p-32(%rax), %ymm6, %ymm0 > vmovapd g-32(%rax), %ymm7 > vfmadd231pd b-32(%rax), %ymm5, %ymm3 > vmovapd o-32(%rax), %ymm4 > vmulpd m-32(%rax), %ymm4, %ymm1 > vmovapd l-32(%rax), %ymm5 > vfmadd231pd f-32(%rax), %ymm7, %ymm2 > vfmadd231pd k-32(%rax), %ymm5, %ymm1 > vaddpd %ymm3, %ymm0, %ymm0 > vaddpd %ymm2, %ymm0, %ymm0 > vaddpd %ymm1, %ymm0, %ymm0 > vmovapd %ymm0, a-32(%rax) > cmpq$8192, %rax > jne .L4 > vzeroupper > ret > > with this patch applied GCC breaks the chain with width = 2 and generates 6 > fma: > > vmovapd a(%rax), %ymm2 > vmovapd c(%rax), %ymm0 > addq$32, %rax > vmovapd e-32(%rax), %ymm1 > vmovapd p-32(%rax), %ymm5 > vmovapd g-32(%rax), %ymm3 > vmovapd j-32(%rax), %ymm6 > vmovapd l-32(%rax), %ymm4 > vmovapd o-32(%rax), %ymm7 > vfmadd132pd b-32(%rax), %ymm2, %ymm0 > vfmadd132pd d-32(%rax), %ymm5, %ymm1 > vfmadd231pd f-32(%rax), %ymm3, %ymm0 > vfmadd231pd h-32(%rax), %ymm6, %ymm1 > vfmadd231pd k-32(%rax), %ymm4, %ymm0 > vfmadd231pd m-32(%rax), %ymm7, %ymm1 > vaddpd %ymm1, %ymm0, %ymm0 > vmovapd %ymm0, a-32(%rax) > cmpq$8192, %rax > jne .L2 > vzeroupper > ret > > gcc/ChangeLog: > > PR gcc/98350 > * tree-ssa-reassoc.cc > (rewrite_expr_tree_parallel): Rewrite this function. > (rank_ops_for_fma): New. > (reassociate_bb): Handle new function. > > gcc/testsuite/ChangeLog: > > PR gcc/98350 > * gcc.dg/pr98350-1.c: New test. > * gcc.dg/pr98350-2.c: Ditto. > --- > gcc/testsuite/gcc.dg/pr98350-1.c | 31 > gcc/testsuite/gcc.dg/pr98350-2.c | 11 ++ > gcc/tree-ssa-reassoc.cc | 256 +-- > 3 files changed, 215 insertions(+), 83 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c > create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c > > diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350- > 1.c > new file mode 100644 >
RE: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
Sorry I've missed the recent updates on trunk regarding handling FMA. I'll measure again if something in this still helps. Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Friday, May 26, 2023 3:15 PM > To: gcc-patches@gcc.gnu.org > Subject: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass > > As GCC's reassociation pass does not have knowledge of FMA, when > transforming expression lists to parallel, it reduces the > opportunities to generate FMAs. Currently there's a workaround > on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114), > that is, to disable the parallelization with floating-point additions. > However, this approach may cause regressions. For example, in the > code below there are only floating-point additions when calculating > "result += array[j]", and rewriting to parallel is better: > > // Compile with -Ofast on aarch64 > float foo (int n, float in) > { > float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 }; > float result = 0.0; > for (int i = 0; i < n; i++) > { > if (i % 10) > for (unsigned j = 0; j < 8; j++) > array[j] *= in; > > for (unsigned j = 0; j < 8; j++) >result += array[j]; > } > return result; > } > > To improve this, one option is to count the number of MUL_EXPRs in an > operator list before rewriting to parallel, and allow the rewriting > when there's none (or 1 MUL_EXPR). This is simple and unlikely to > introduce regressions. However it lacks flexibility and can not handle > more general cases. > > Here's an attempt to address the issue more generally. > > 1. Added an additional widening_mul pass before the original reassoc2 > pass. The new pass is limited to only insert FMA, and leave other > operations like convert_mult_to_widen to the old late widening_mul pass, > in case other optimizations between the two passes could be hindered. > > 2. On some platforms, for a very long FMA chain, rewriting to parallel > can be faster. Extended the original "deferring" logic so that all > conversions to FMA can be deferred. Introduced a new parameter > op-count-prefer-reassoc to control this behavior. > > 3. Additionally, the new widening_mul pass calls execute_reassoc first, > to avoid losing opportunities such as folding constants and > undistributing. > > However, changing the sequence of generating FMA and reassociation may > expose more FMA chains that are slow (see commit 4a0d0ed2). > To reduce possible regressions, improved handling the slow FMA chain by: > > 1. Modified result_of_phi to support checking an additional FADD/FMUL. > > 2. On some CPUs, rather than removing the whole FMA chain, only skipping > a few candidates may generate faster code. Added new parameter > fskip-fma-heuristic to control this behavior. > > This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350. > > Thanks, > Di Zhao
[RFC][PATCH] Improve generating FMA by adding a widening_mul pass
As GCC's reassociation pass does not have knowledge of FMA, when transforming expression lists to parallel, it reduces the opportunities to generate FMAs. Currently there's a workaround on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114), that is, to disable the parallelization with floating-point additions. However, this approach may cause regressions. For example, in the code below there are only floating-point additions when calculating "result += array[j]", and rewriting to parallel is better: // Compile with -Ofast on aarch64 float foo (int n, float in) { float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 }; float result = 0.0; for (int i = 0; i < n; i++) { if (i % 10) for (unsigned j = 0; j < 8; j++) array[j] *= in; for (unsigned j = 0; j < 8; j++) result += array[j]; } return result; } To improve this, one option is to count the number of MUL_EXPRs in an operator list before rewriting to parallel, and allow the rewriting when there's none (or 1 MUL_EXPR). This is simple and unlikely to introduce regressions. However it lacks flexibility and can not handle more general cases. Here's an attempt to address the issue more generally. 1. Added an additional widening_mul pass before the original reassoc2 pass. The new pass is limited to only insert FMA, and leave other operations like convert_mult_to_widen to the old late widening_mul pass, in case other optimizations between the two passes could be hindered. 2. On some platforms, for a very long FMA chain, rewriting to parallel can be faster. Extended the original "deferring" logic so that all conversions to FMA can be deferred. Introduced a new parameter op-count-prefer-reassoc to control this behavior. 3. Additionally, the new widening_mul pass calls execute_reassoc first, to avoid losing opportunities such as folding constants and undistributing. However, changing the sequence of generating FMA and reassociation may expose more FMA chains that are slow (see commit 4a0d0ed2). To reduce possible regressions, improved handling the slow FMA chain by: 1. Modified result_of_phi to support checking an additional FADD/FMUL. 2. On some CPUs, rather than removing the whole FMA chain, only skipping a few candidates may generate faster code. Added new parameter fskip-fma-heuristic to control this behavior. This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350. Thanks, Di Zhao 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch Description: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch
PING: [PATCH v6] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Hi, I saw that Stage 1 of GCC 13 development is just ended. So is this considered? Or should I bring this up when general development is reopened? Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Tuesday, October 25, 2022 8:18 AM > To: gcc-patches@gcc.gnu.org > Cc: Richard Biener > Subject: [PATCH v6] tree-optimization/101186 - extend FRE with "equivalence > map" for condition prediction > > Sorry for the late update. I've been on a vacation and then I > spent some time updating and verifying the patch. > > Attached is a new version of the patch. There are some changes: > > 1. Store equivalences in a vn_pval chain in vn_ssa_aux, rather than >in the expression hash table. (Following Richard's suggestion.) > 2. Extracted insert_single_predicated_value function. > 3. Simplify record_equiv_from_prev_phi a bit. > 4. Changed some of the functions' names and tried to improve the >comments a little. > > Current status of the new testcases in the patch: > > ssa-fre-200.c Can also be optimized by evrp > ssa-fre-201.c Not optimized in trunk. > ssa-fre-202.c foo() can be removed by evrp; while x + b is not > folded. > ssa-pre-34.c Not optimized in trunk. > > Initially, this patch is motivated to remove the unreachable codes > in case like ssa-pre-34.c, in which we need to use equivalence > relation produced from a preceding condition for another condition. > VRP didn't optimize that because it needs jump threading to make > the relation valid at the second condition. > > After browsing the mechanisms of VRP and FRE, it seems to me there > are two options: 1) Teach VRP to identify related but not threaded > conditions. That might require introducing value-numbering into VRP > to detect common expressions, and I think is too much for this. > 2) Introduce temporary equivalence in sccvn, which I thought would > change less on current code. (And along the reviews and updating > patch I see how ad-hoc it was.) > > I saw from the talk about VN there's plan to replace predicated > values by ranger. So how does it goes? Is there something I can help > with? (For the case ssa-pre-34.c, I think maybe it still needs the > predicated-value support, to lookup related conditional expressions.) > > Below are about questions in the last review: > > > > /* Valid hashtables storing information we have proven to be > > > correct. */ > > > @@ -490,9 +492,9 @@ VN_INFO (tree name) > > > nary->predicated_values = 0; > > > nary->u.result = boolean_true_node; > > > vn_nary_op_insert_into (nary, valid_info->nary); > > > - gcc_assert (nary->unwind_to == NULL); > > > > why's that? doesn't this mean unwinding will be broken? > > Previously, predicate "argument_x == NULL" or "argument_x != NULL" > is always new here (because argument_x's VN is just inserted.) > But with the patch, there can be slot for "argument_x == NULL" > or "argument_x != NULL" already. It won't break unwinding as the > new value is not linked to the unwind-chain. > > > > > > /* Also do not link it into the undo chain. */ > > > last_inserted_nary = nary->next; > > > + /* There could be a predicate already. */ > > > nary->next = (vn_nary_op_t)(void *)-1; > > > nary = alloc_vn_nary_op_noinit (2, _tables_insert_obstack); > > > init_vn_nary_op_from_pieces (nary, 2, EQ_EXPR, > > > > /* Compute and return the hash value for nary operation VBO1. */ > > > > > > hashval_t > > > @@ -4226,6 +4342,9 @@ init_vn_nary_op_from_stmt (vn_nary_op_t vno, gassign > *stmt) > > >for (i = 0; i < vno->length; ++i) > > > vno->op[i] = gimple_op (stmt, i + 1); > > > } > > > + /* Insert and lookup N-ary results by the operands' equivalence heads. > */ > > > + if (gimple_bb (stmt)) > > > +lookup_equiv_heads (vno->length, vno->op, vno->op, gimple_bb (stmt)); > > > > That seems like the wrong place, the function didn't even valueize before. > > To utilize temp-equivalences and get more simplified result, n-ary > expressions should be always inserted and lookup by the operands' > equivalence heads. So practically all the places > init_vn_nary_op_from_stmt is used, lookup_equiv_heads (changed to > get_equiv_heads) should be called. As I haven't found better place > to put that, I just left it here in the patch.. > > > > visit_nary_op (tree lhs, gassign *stmt) > > > { > > >vn_nary_op_t vnresult; > > > - tree result = vn_nary_op_lookup_stmt (stmt, ); > > > - if (! result && vnresult) > > > + unsigned length = vn_nary_length_from_stmt (stmt); > > > + vn_nary_op_t vno > > > += XALLOCAVAR (struct vn_nary_op_s, sizeof_vn_nary_op (length)); > > > + init_vn_nary_op_from_stmt (vno, stmt); > > > + tree result = NULL_TREE; > > > + /* Try to get a simplified result. */ > > > + /* Do not simplify variable used in PHI at loop exit, or > > > + simplify_peeled_chrec/constant_after_peeling may miss the loop. */ > > > + gimple *use_stmt; > > > +
[PATCH v6] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Sorry for the late update. I've been on a vacation and then I spent some time updating and verifying the patch. Attached is a new version of the patch. There are some changes: 1. Store equivalences in a vn_pval chain in vn_ssa_aux, rather than in the expression hash table. (Following Richard's suggestion.) 2. Extracted insert_single_predicated_value function. 3. Simplify record_equiv_from_prev_phi a bit. 4. Changed some of the functions' names and tried to improve the comments a little. Current status of the new testcases in the patch: ssa-fre-200.c Can also be optimized by evrp ssa-fre-201.c Not optimized in trunk. ssa-fre-202.c foo() can be removed by evrp; while x + b is not folded. ssa-pre-34.cNot optimized in trunk. Initially, this patch is motivated to remove the unreachable codes in case like ssa-pre-34.c, in which we need to use equivalence relation produced from a preceding condition for another condition. VRP didn't optimize that because it needs jump threading to make the relation valid at the second condition. After browsing the mechanisms of VRP and FRE, it seems to me there are two options: 1) Teach VRP to identify related but not threaded conditions. That might require introducing value-numbering into VRP to detect common expressions, and I think is too much for this. 2) Introduce temporary equivalence in sccvn, which I thought would change less on current code. (And along the reviews and updating patch I see how ad-hoc it was.) I saw from the talk about VN there's plan to replace predicated values by ranger. So how does it goes? Is there something I can help with? (For the case ssa-pre-34.c, I think maybe it still needs the predicated-value support, to lookup related conditional expressions.) Below are about questions in the last review: > > /* Valid hashtables storing information we have proven to be > > correct. */ > > @@ -490,9 +492,9 @@ VN_INFO (tree name) > > nary->predicated_values = 0; > > nary->u.result = boolean_true_node; > > vn_nary_op_insert_into (nary, valid_info->nary); > > - gcc_assert (nary->unwind_to == NULL); > > why's that? doesn't this mean unwinding will be broken? Previously, predicate "argument_x == NULL" or "argument_x != NULL" is always new here (because argument_x's VN is just inserted.) But with the patch, there can be slot for "argument_x == NULL" or "argument_x != NULL" already. It won't break unwinding as the new value is not linked to the unwind-chain. > > > /* Also do not link it into the undo chain. */ > > last_inserted_nary = nary->next; > > + /* There could be a predicate already. */ > > nary->next = (vn_nary_op_t)(void *)-1; > > nary = alloc_vn_nary_op_noinit (2, _tables_insert_obstack); > > init_vn_nary_op_from_pieces (nary, 2, EQ_EXPR, > > /* Compute and return the hash value for nary operation VBO1. */ > > > > hashval_t > > @@ -4226,6 +4342,9 @@ init_vn_nary_op_from_stmt (vn_nary_op_t vno, gassign > > *stmt) > >for (i = 0; i < vno->length; ++i) > > vno->op[i] = gimple_op (stmt, i + 1); > > } > > + /* Insert and lookup N-ary results by the operands' equivalence heads. > > */ > > + if (gimple_bb (stmt)) > > +lookup_equiv_heads (vno->length, vno->op, vno->op, gimple_bb (stmt)); > > That seems like the wrong place, the function didn't even valueize before. To utilize temp-equivalences and get more simplified result, n-ary expressions should be always inserted and lookup by the operands' equivalence heads. So practically all the places init_vn_nary_op_from_stmt is used, lookup_equiv_heads (changed to get_equiv_heads) should be called. As I haven't found better place to put that, I just left it here in the patch.. > > visit_nary_op (tree lhs, gassign *stmt) > > { > >vn_nary_op_t vnresult; > > - tree result = vn_nary_op_lookup_stmt (stmt, ); > > - if (! result && vnresult) > > + unsigned length = vn_nary_length_from_stmt (stmt); > > + vn_nary_op_t vno > > += XALLOCAVAR (struct vn_nary_op_s, sizeof_vn_nary_op (length)); > > + init_vn_nary_op_from_stmt (vno, stmt); > > + tree result = NULL_TREE; > > + /* Try to get a simplified result. */ > > + /* Do not simplify variable used in PHI at loop exit, or > > + simplify_peeled_chrec/constant_after_peeling may miss the loop. */ > > + gimple *use_stmt; > > + use_operand_p use_p; > > + if (!(single_imm_use (lhs, _p, _stmt) > > + && gimple_code (use_stmt) == GIMPLE_PHI > > + && single_succ_p (gimple_bb (use_stmt)) > > + && (single_succ_edge (gimple_bb (use_stmt))->flags & EDGE_DFS_BACK))) > > +result = fold_const_from_equiv_heads (vno->length, vno->opcode, > > vno->op, > > + vno->type); > > copy propagating conditional equivalences has proved problematic, why > do this at all when there's no actual simplification? It's a bit odd that > we need a special fold_const_from_equiv_heads
PING^2: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Gentle ping again. Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Tuesday, July 12, 2022 2:08 AM > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Richard Biener' > Subject: PING: [PATCH v5] tree-optimization/101186 - extend FRE with > "equivalence map" for condition prediction > > Updated the patch in the attachment, so it can apply. > > Thanks, > Di Zhao > > > -Original Message- > > From: Di Zhao OS > > Sent: Sunday, May 29, 2022 11:59 PM > > To: gcc-patches@gcc.gnu.org > > Cc: Richard Biener > > Subject: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence > > map" for condition prediction > > > > Hi, attached is a new version of the patch. The changes are: > > - Skip using temporary equivalences for floating-point values, because > > folding expressions can generate incorrect values. For example, > > operations on 0.0 and -0.0 may have different results. > > - Avoid inserting duplicated back-refs from value-number to predicates. > > - Disable fre in testsuite/g++.dg/pr83541.C . > > > > Summary of the previous versions: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html > > > > Is the patch still considered? > > > > Thanks, > > Di Zhao > > > > --- > > > > Extend FRE with temporary equivalences. > > > > 2022-05-29 Di Zhao > > > > gcc/ChangeLog: > > PR tree-optimization/101186 > > * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a > > predicate already). > > (dominated_by_p_w_unex): Moved upward. > > (vn_nary_op_get_predicated_value): Moved upward. > > (is_vn_valid_at_bb): Check if vn_pval is valid at BB. > > (lookup_equiv_head): Lookup the "equivalence head" of given node. > > (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes. > > (vn_tracking_edge): Extracted utility function. > > (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence > head"s. > > (vn_nary_op_insert_into): Insert new value at the front. > > (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values > > from pieces. > > (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads. > > (push_new_nary_ref): Insert a back-reference to vn_nary_op_t. > > (val_equiv_insert): Record temporary equivalence. > > (vn_nary_op_insert_pieces_predicated): Record equivalences instead > of > > some predicates; insert back-refs. > > (record_equiv_from_prev_phi_1): Record temporary equivalences > > generated > > by PHI nodes. > > (record_equiv_from_prev_phi): Given an outgoing edge of a > conditional > > expression taken, record equivalences generated by PHI nodes. > > (visit_nary_op): Add lookup previous results of N-ary operations by > > equivalences. > > (insert_related_predicates_on_edge): Some predicates can be computed > > from equivalences, no need to insert them. > > (process_bb): Add lookup predicated values by equivalences. > > (struct unwind_state): Unwind state of back-refs to vn_nary_op_t. > > (do_unwind): Unwind the back-refs to vn_nary_op_t. > > (do_rpo_vn): Update back-reference unwind state. > > * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to > > the > > nary map entries. > > > > gcc/testsuite/ChangeLog: > > > > * g++.dg/pr83541.C: Disable fre. > > * gcc.dg/tree-ssa/pr68619-2.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-1.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-2.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-3.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-5.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-7.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-8.c: Disable fre. > > * gcc.dg/tree-ssa/pr71947-9.c: Disable fre. > > * gcc.dg/tree-ssa/vrp03.c: Disable fre. > > * gcc.dg/tree-ssa/ssa-fre-100.c: New test. > > * gcc.dg/tree-ssa/ssa-fre-101.c: New test. > > * gcc.dg/tree-ssa/ssa-fre-102.c: New test. > > * gcc.dg/tree-ssa/ssa-pre-34.c: New test.
PING: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Updated the patch in the attachment, so it can apply. Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Sunday, May 29, 2022 11:59 PM > To: gcc-patches@gcc.gnu.org > Cc: Richard Biener > Subject: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence > map" for condition prediction > > Hi, attached is a new version of the patch. The changes are: > - Skip using temporary equivalences for floating-point values, because > folding expressions can generate incorrect values. For example, > operations on 0.0 and -0.0 may have different results. > - Avoid inserting duplicated back-refs from value-number to predicates. > - Disable fre in testsuite/g++.dg/pr83541.C . > > Summary of the previous versions: > https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html > > Is the patch still considered? > > Thanks, > Di Zhao > > --- > > Extend FRE with temporary equivalences. > > 2022-05-29 Di Zhao > > gcc/ChangeLog: > PR tree-optimization/101186 > * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a > predicate already). > (dominated_by_p_w_unex): Moved upward. > (vn_nary_op_get_predicated_value): Moved upward. > (is_vn_valid_at_bb): Check if vn_pval is valid at BB. > (lookup_equiv_head): Lookup the "equivalence head" of given node. > (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes. > (vn_tracking_edge): Extracted utility function. > (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s. > (vn_nary_op_insert_into): Insert new value at the front. > (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values > from pieces. > (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads. > (push_new_nary_ref): Insert a back-reference to vn_nary_op_t. > (val_equiv_insert): Record temporary equivalence. > (vn_nary_op_insert_pieces_predicated): Record equivalences instead of > some predicates; insert back-refs. > (record_equiv_from_prev_phi_1): Record temporary equivalences > generated > by PHI nodes. > (record_equiv_from_prev_phi): Given an outgoing edge of a conditional > expression taken, record equivalences generated by PHI nodes. > (visit_nary_op): Add lookup previous results of N-ary operations by > equivalences. > (insert_related_predicates_on_edge): Some predicates can be computed > from equivalences, no need to insert them. > (process_bb): Add lookup predicated values by equivalences. > (struct unwind_state): Unwind state of back-refs to vn_nary_op_t. > (do_unwind): Unwind the back-refs to vn_nary_op_t. > (do_rpo_vn): Update back-reference unwind state. > * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to > the > nary map entries. > > gcc/testsuite/ChangeLog: > > * g++.dg/pr83541.C: Disable fre. > * gcc.dg/tree-ssa/pr68619-2.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-1.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-2.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-3.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-5.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-7.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-8.c: Disable fre. > * gcc.dg/tree-ssa/pr71947-9.c: Disable fre. > * gcc.dg/tree-ssa/vrp03.c: Disable fre. > * gcc.dg/tree-ssa/ssa-fre-100.c: New test. > * gcc.dg/tree-ssa/ssa-fre-101.c: New test. > * gcc.dg/tree-ssa/ssa-fre-102.c: New test. > * gcc.dg/tree-ssa/ssa-pre-34.c: New test. v5-tree-optimization-101186.patch Description: v5-tree-optimization-101186.patch
[PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Hi, attached is a new version of the patch. The changes are: - Skip using temporary equivalences for floating-point values, because folding expressions can generate incorrect values. For example, operations on 0.0 and -0.0 may have different results. - Avoid inserting duplicated back-refs from value-number to predicates. - Disable fre in testsuite/g++.dg/pr83541.C . Summary of the previous versions: https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html Is the patch still considered? Thanks, Di Zhao --- Extend FRE with temporary equivalences. 2022-05-29 Di Zhao gcc/ChangeLog: PR tree-optimization/101186 * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a predicate already). (dominated_by_p_w_unex): Moved upward. (vn_nary_op_get_predicated_value): Moved upward. (is_vn_valid_at_bb): Check if vn_pval is valid at BB. (lookup_equiv_head): Lookup the "equivalence head" of given node. (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes. (vn_tracking_edge): Extracted utility function. (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s. (vn_nary_op_insert_into): Insert new value at the front. (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values from pieces. (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads. (push_new_nary_ref): Insert a back-reference to vn_nary_op_t. (val_equiv_insert): Record temporary equivalence. (vn_nary_op_insert_pieces_predicated): Record equivalences instead of some predicates; insert back-refs. (record_equiv_from_prev_phi_1): Record temporary equivalences generated by PHI nodes. (record_equiv_from_prev_phi): Given an outgoing edge of a conditional expression taken, record equivalences generated by PHI nodes. (visit_nary_op): Add lookup previous results of N-ary operations by equivalences. (insert_related_predicates_on_edge): Some predicates can be computed from equivalences, no need to insert them. (process_bb): Add lookup predicated values by equivalences. (struct unwind_state): Unwind state of back-refs to vn_nary_op_t. (do_unwind): Unwind the back-refs to vn_nary_op_t. (do_rpo_vn): Update back-reference unwind state. * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the nary map entries. gcc/testsuite/ChangeLog: * g++.dg/pr83541.C: Disable fre. * gcc.dg/tree-ssa/pr68619-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-1.c: Disable fre. * gcc.dg/tree-ssa/pr71947-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-3.c: Disable fre. * gcc.dg/tree-ssa/pr71947-5.c: Disable fre. * gcc.dg/tree-ssa/pr71947-7.c: Disable fre. * gcc.dg/tree-ssa/pr71947-8.c: Disable fre. * gcc.dg/tree-ssa/pr71947-9.c: Disable fre. * gcc.dg/tree-ssa/vrp03.c: Disable fre. * gcc.dg/tree-ssa/ssa-fre-100.c: New test. * gcc.dg/tree-ssa/ssa-fre-101.c: New test. * gcc.dg/tree-ssa/ssa-fre-102.c: New test. * gcc.dg/tree-ssa/ssa-pre-34.c: New test. v5-tree-optimization-101186.patch Description: v5-tree-optimization-101186.patch
[PATCH v4] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Here's a brief summary on the patch: v4 (this version): - In process_bb's condition-prediction code: update equivalence-heads if value-numbers have changed, otherwise some chances can be lost. v3 (a few minor updates): - Simplify function record_equiv_from_prev_phi_1 by removing an argument. - Fixed two small bugs that can lead to losing optimize opportunities. - Link: https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586687.html v2: - Use equivalence-heads (disjoint set) to represent temporary equivalences. - Improved performance by reducing predicates-recording. - The last version and some discussion: https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586116.html - Earlier discussion: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/580688.html - The patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579607.html v1: - The initial patch (use a hashmap to store temporary equivalences): https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573603.html - Updates (fixed some code-style issue) and some explanation: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575531.html - Reply from Richard: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576185.html If the patch is considered, a following step I'm thinking is to teach PRE pass with the temporary equivalences. Currently regression test pr68619-4.c fails with the patch, because a variable's value number turns to constant, and its equivalent's head is lost in the AVAIL set. I think with this change, more partial redundancies can be found. Thanks, Di Zhao --- Extend FRE with temporary equivalences. 2021-12-24 Di Zhao gcc/ChangeLog: PR tree-optimization/101186 * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a predicate already). (dominated_by_p_w_unex): Moved upward. (vn_nary_op_get_predicated_value): Moved upward. (is_vn_valid_at_bb): Check if vn_pval is valid at BB. (lookup_equiv_head): Lookup the "equivalence head" of given node. (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes. (vn_tracking_edge): Extracted utility function. (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s. (vn_nary_op_insert_into): Insert new value at the front. (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values from pieces. (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads. (push_new_nary_ref): Insert a back-reference to vn_nary_op_t. (val_equiv_insert): Record temporary equivalence. (vn_nary_op_insert_pieces_predicated): Record equivalences instead of some predicates; insert back-refs. (record_equiv_from_prev_phi_1): Record temporary equivalences generated by PHI nodes. (record_equiv_from_prev_phi): Given an outgoing edge of a conditional expression taken, record equivalences generated by PHI nodes. (visit_nary_op): Add lookup previous results of N-ary operations by equivalences. (insert_related_predicates_on_edge): Some predicates can be computed from equivalences, no need to insert them. (process_bb): Add lookup predicated values by equivalences. (struct unwind_state): Unwind state of back-refs to vn_nary_op_t. (do_unwind): Unwind the back-refs to vn_nary_op_t. (do_rpo_vn): Update back-reference unwind state. * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the nary map entries. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/pr68619-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-1.c: Disable fre. * gcc.dg/tree-ssa/pr71947-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-3.c: Disable fre. * gcc.dg/tree-ssa/pr71947-5.c: Disable fre. * gcc.dg/tree-ssa/pr71947-7.c: Disable fre. * gcc.dg/tree-ssa/pr71947-8.c: Disable fre. * gcc.dg/tree-ssa/pr71947-9.c: Disable fre. * gcc.dg/tree-ssa/vrp03.c: Disable fre. * gcc.dg/tree-ssa/ssa-fre-100.c: New test. * gcc.dg/tree-ssa/ssa-fre-101.c: New test. * gcc.dg/tree-ssa/ssa-fre-102.c: New test. * gcc.dg/tree-ssa/ssa-pre-34.c: New test. v4-tree-optimization-101186.patch Description: v4-tree-optimization-101186.patch
[PATCH v3] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
A few minor updates on the patch: - Simplify function record_equiv_from_prev_phi_1 by removing an argument. - Fixed two small bugs that can lead to losing optimize opportunities. Thanks, Di Zhao --- Extend FRE with temporary equivalences. 2021-12-13 Di Zhao gcc/ChangeLog: PR tree-optimization/101186 * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a predicate already). (dominated_by_p_w_unex): Moved upward. (vn_nary_op_get_predicated_value): Moved upward. (is_vn_valid_at_bb): Check if vn_pval is valid at BB. (lookup_equiv_head): Lookup the "equivalence head" of given node. (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes. (vn_tracking_edge): Extracted utility function. (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s. (vn_nary_op_insert_into): Insert new value at the front. (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values from pieces. (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads. (push_new_nary_ref): Insert a back-reference to vn_nary_op_t. (val_equiv_insert): Record temporary equivalence. (vn_nary_op_insert_pieces_predicated): Record equivalences instead of some predicates; insert back-refs. (record_equiv_from_prev_phi_1): Record temporary equivalences generated by PHI nodes. (record_equiv_from_prev_phi): Given an outgoing edge of a conditional expression taken, record equivalences generated by PHI nodes. (visit_nary_op): Add lookup previous results of N-ary operations by equivalences. (insert_related_predicates_on_edge): Some predicates can be computed from equivalences, no need to insert them. (process_bb): Add lookup predicated values by equivalenc~/tmp/time-frees. (struct unwind_state): Unwind state of back-refs to vn_nary_op_t. (do_unwind): Unwind the back-refs to vn_nary_op_t. (do_rpo_vn): Update back-reference unwind state. * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the nary map entries. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/pr68619-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-1.c: Disable fre. * gcc.dg/tree-ssa/pr71947-2.c: Disable fre. * gcc.dg/tree-ssa/pr71947-3.c: Disable fre. * gcc.dg/tree-ssa/pr71947-5.c: Disable fre. * gcc.dg/tree-ssa/pr71947-7.c: Disable fre. * gcc.dg/tree-ssa/pr71947-8.c: Disable fre. * gcc.dg/tree-ssa/pr71947-9.c: Disable fre. * gcc.dg/tree-ssa/vrp03.c: Disable fre. * gcc.dg/tree-ssa/ssa-fre-100.c: New test. * gcc.dg/tree-ssa/ssa-fre-101.c: New test. * gcc.dg/tree-ssa/ssa-fre-102.c: New test. * gcc.dg/tree-ssa/ssa-pre-34.c: New test. v3-tree-optimization-101186.patch Description: v3-tree-optimization-101186.patch
RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
I'm very sorry there seems to be encoding issue in the attachment in my last email. Attached is the new patch. Thanks, Di Zhao > -Original Message- > From: Di Zhao OS > Sent: Tuesday, November 16, 2021 1:24 AM > To: 'Richard Biener' > Cc: gcc-patches@gcc.gnu.org > Subject: RE: [PATCH v2] tree-optimization/101186 - extend FRE with > "equivalence map" for condition prediction > > Attached is the updated patch. Fixed some errors in testcases. > > > -Original Message- > > From: Richard Biener > > Sent: Wednesday, November 10, 2021 5:44 PM > > To: Di Zhao OS > > Cc: gcc-patches@gcc.gnu.org; Andrew MacLeod > > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > > "equivalence map" for condition prediction > > > > On Sun, Oct 24, 2021 at 9:03 PM Di Zhao OS > > wrote: > > > > > > Hi, > > > > > > Attached is a new version of the patch, mainly for improving performance > > > and simplifying the code. > > > > The patch doesn't apply anymore, can you update it please? > > > > I see the new ssa-fre-101.c test already passing without the patch. > > It was a mistake in test ssa-fre-101.c::g to define the variables with > the unsigned integers, in this way "a >= 0" is always true. After modified > the case, now fre1 in the patch can remove unreachable call "foo ()", and > evrp on trunk does not. > > > Likewise ssa-fre-100.c and ssa-fre-102.c would PASS if you scan > > the pass dump after fre1 which is evrp so it seems that evrp already > > handles the equivalences (likely with the relation oracle) now? > > I'm sure there are second order effects when eliminating conditions > > in FRE but did you re-evaluate what made you improve VN to see > > if the cases are handled as expected now without this change? > > In case ssa-fre-102.c, the unreachable call "foo ()" can be removed by > evrp, but fre in the patch can additionally replace "g = x + b" with > "g = 99". (Again I'm sorry the regex to check this was wrong..) > > Test cases to simulate the original problem I found are moved into > gcc.dg/tree-ssa/ssa-pre-34.c. The unreachable calls to "foo ()" are > still removed by pre with the patch. (If compiled with -O3, the > "foo ()"s in test file can now be removed by thread2/threadfull2 and > dom3 on trunk. This relies on jump threading across the loops, so > even with -O3, similar cases may not get optimized say if there're > too many statements to copy.) > > Thanks, > Di Zhao > > > > > I will still look at and consider the change btw, but given the EVRP > > improvements I'm also considering to remove the predication > > support from VN alltogether. At least in the non-iterating mode > > it should be trivially easy to use rangers relation oracle to simplify > > predicates. For the iterating mode it might not be 100% effective > > since I'm not sure we can make it use the current SSA values and > > how it would behave with those eventually changing to worse. > > > > Andrew, how would one ask the relation oracle to simplify a > > condition? Do I have to do any bookkeeping to register > > predicates on edges for it? > > > > Thanks, > > Richard. > > > > > First, regarding the comments: > > > > > > > -Original Message- > > > > From: Richard Biener > > > > Sent: Friday, October 1, 2021 9:00 PM > > > > To: Di Zhao OS > > > > Cc: gcc-patches@gcc.gnu.org > > > > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > > > > "equivalence map" for condition prediction > > > > > > > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS > > > > wrote: > > > > > > > > > > Sorry about updating on this after so long. It took me much time to > > > > > work out a > > > > > new plan and pass the tests. > > > > > > > > > > The new idea is to use one variable to represent a set of equal > > > > > variables at > > > > > some basic-block. This variable is called a "equivalence head" or > > > > > "equiv-head" > > > > > in the code. (There's no-longer a "equivalence map".) > > > > > > > > > > - Initially an SSA_NAME's "equivalence head" is its value number. > > > > > Temporary > > > > > equivalence heads are recorded as unary NOP_EXPR results in the > > > > > vn_nary_op_t > > > > > map. Besides, when inserting into vn_nary_op_t map, make the new > > > > > result at > > > > > front of the vn_pval list, so that when searching for a variable's > > > > > equivalence head, the first result represents the largest > > > > > equivalence set at > > > > > current location. > > > > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary > > > > > entry. > > > > > For recorded equivalences, the reference is result->entry; for > > > > > normal N-ary > > > > > operations, the reference is operand->entry. > > > > > - When recording equivalences, if one side A is constant or has more > > > > > refs, make > > > > > it the new equivalence head of the other side B. Traverse B's ref- > > > > > list,if a > > > > > variable C's previous equiv-head is B, update to
RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Attached is the updated patch. Fixed some errors in testcases. > -Original Message- > From: Richard Biener > Sent: Wednesday, November 10, 2021 5:44 PM > To: Di Zhao OS > Cc: gcc-patches@gcc.gnu.org; Andrew MacLeod > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > "equivalence map" for condition prediction > > On Sun, Oct 24, 2021 at 9:03 PM Di Zhao OS > wrote: > > > > Hi, > > > > Attached is a new version of the patch, mainly for improving performance > > and simplifying the code. > > The patch doesn't apply anymore, can you update it please? > > I see the new ssa-fre-101.c test already passing without the patch. It was a mistake in test ssa-fre-101.c::g to define the variables with the unsigned integers, in this way "a >= 0" is always true. After modified the case, now fre1 in the patch can remove unreachable call "foo ()", and evrp on trunk does not. > Likewise ssa-fre-100.c and ssa-fre-102.c would PASS if you scan > the pass dump after fre1 which is evrp so it seems that evrp already > handles the equivalences (likely with the relation oracle) now? > I'm sure there are second order effects when eliminating conditions > in FRE but did you re-evaluate what made you improve VN to see > if the cases are handled as expected now without this change? In case ssa-fre-102.c, the unreachable call "foo ()" can be removed by evrp, but fre in the patch can additionally replace "g = x + b" with "g = 99". (Again I'm sorry the regex to check this was wrong..) Test cases to simulate the original problem I found are moved into gcc.dg/tree-ssa/ssa-pre-34.c. The unreachable calls to "foo ()" are still removed by pre with the patch. (If compiled with -O3, the "foo ()"s in test file can now be removed by thread2/threadfull2 and dom3 on trunk. This relies on jump threading across the loops, so even with -O3, similar cases may not get optimized say if there're too many statements to copy.) Thanks, Di Zhao > > I will still look at and consider the change btw, but given the EVRP > improvements I'm also considering to remove the predication > support from VN alltogether. At least in the non-iterating mode > it should be trivially easy to use rangers relation oracle to simplify > predicates. For the iterating mode it might not be 100% effective > since I'm not sure we can make it use the current SSA values and > how it would behave with those eventually changing to worse. > > Andrew, how would one ask the relation oracle to simplify a > condition? Do I have to do any bookkeeping to register > predicates on edges for it? > > Thanks, > Richard. > > > First, regarding the comments: > > > > > -Original Message- > > > From: Richard Biener > > > Sent: Friday, October 1, 2021 9:00 PM > > > To: Di Zhao OS > > > Cc: gcc-patches@gcc.gnu.org > > > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > > > "equivalence map" for condition prediction > > > > > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS > > > wrote: > > > > > > > > Sorry about updating on this after so long. It took me much time to > > > > work out a > > > > new plan and pass the tests. > > > > > > > > The new idea is to use one variable to represent a set of equal > > > > variables at > > > > some basic-block. This variable is called a "equivalence head" or > > > > "equiv-head" > > > > in the code. (There's no-longer a "equivalence map".) > > > > > > > > - Initially an SSA_NAME's "equivalence head" is its value number. > > > > Temporary > > > > equivalence heads are recorded as unary NOP_EXPR results in the > > > > vn_nary_op_t > > > > map. Besides, when inserting into vn_nary_op_t map, make the new > > > > result at > > > > front of the vn_pval list, so that when searching for a variable's > > > > equivalence head, the first result represents the largest equivalence > > > > set at > > > > current location. > > > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary > > > > entry. > > > > For recorded equivalences, the reference is result->entry; for normal > > > > N-ary > > > > operations, the reference is operand->entry. > > > > - When recording equivalences, if one side A is constant or has more > > > > refs, make > > > > it the new equivalence head of the other side B. Traverse B's > > > > ref-list,if a > > > > variable C's previous equiv-head is B, update to A. And re-insert B's > > > > n-ary > > > > operations by replacing B with A. > > > > - When inserting and looking for the results of n-ary operations, > > > > insert and > > > > lookup by the operands' equiv-heads. > > > > ... > > > > > > > > Thanks, > > > > Di Zhao > > > > > > > > > > > > Extend FRE with temporary equivalences. > > > > > > Comments on the patch: > > > > > > + /* nary_ref count. */ > > > + unsigned num_nary_ref; > > > + > > > > > > I think a unsigned short should be enough and that would nicely > > > pack after value_id together with the bitfield (maybe change
[PING] [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Hi, Gentle ping on this. Di Zhao -Original Message- From: Di Zhao OS Sent: Monday, October 25, 2021 3:03 AM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction Hi, Attached is a new version of the patch, mainly for improving performance and simplifying the code. First, regarding the comments: > -Original Message- > From: Richard Biener > Sent: Friday, October 1, 2021 9:00 PM > To: Di Zhao OS > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > "equivalence map" for condition prediction > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS > wrote: > > > > Sorry about updating on this after so long. It took me much time to work > > out a > > new plan and pass the tests. > > > > The new idea is to use one variable to represent a set of equal variables at > > some basic-block. This variable is called a "equivalence head" or > > "equiv-head" > > in the code. (There's no-longer a "equivalence map".) > > > > - Initially an SSA_NAME's "equivalence head" is its value number. Temporary > > equivalence heads are recorded as unary NOP_EXPR results in the > > vn_nary_op_t > > map. Besides, when inserting into vn_nary_op_t map, make the new result at > > front of the vn_pval list, so that when searching for a variable's > > equivalence head, the first result represents the largest equivalence set > > at > > current location. > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry. > > For recorded equivalences, the reference is result->entry; for normal > > N-ary > > operations, the reference is operand->entry. > > - When recording equivalences, if one side A is constant or has more refs, > > make > > it the new equivalence head of the other side B. Traverse B's ref-list, > > if a > > variable C's previous equiv-head is B, update to A. And re-insert B's > > n-ary > > operations by replacing B with A. > > - When inserting and looking for the results of n-ary operations, insert and > > lookup by the operands' equiv-heads. > > ... > > > > Thanks, > > Di Zhao > > > > > > Extend FRE with temporary equivalences. > > Comments on the patch: > > + /* nary_ref count. */ > + unsigned num_nary_ref; > + > > I think a unsigned short should be enough and that would nicely > pack after value_id together with the bitfield (maybe change that > to unsigned short :1 then). Changed num_nary_ref to unsigned short and moved after value_id. > @@ -7307,17 +7839,23 @@ process_bb (rpo_elim , basic_block bb, > tree val = gimple_simplify (gimple_cond_code (last), > boolean_type_node, lhs, rhs, > NULL, vn_valueize); > + vn_nary_op_t vnresult = NULL; > /* If the condition didn't simplfy see if we have recorded >an expression from sofar taken edges. */ > if (! val || TREE_CODE (val) != INTEGER_CST) > { > - vn_nary_op_t vnresult; > > looks like you don't need vnresult outside of the if()? vnresult is reused later to record equivalences generated by PHI nodes. > +/* Find predicated value of vn_nary_op by the operands' equivalences. Return > + * NULL_TREE if no known result is found. */ > + > +static tree > +find_predicated_value_by_equivs (vn_nary_op_t vno, basic_block bb, > +vn_nary_op_t *vnresult) > +{ > + lookup_equiv_heads (vno->length, vno->op, vno->op, bb); > + tree result > += simplify_nary_op (vno->length, vno->opcode, vno->op, vno->type); > > why is it necessary to simplify here? It looks like the caller > already does this. In the new patch, changed the code a little to remove redundant calculation. > I wonder whether it's valid to always perform find_predicated_value_by_equivs > from inside vn_nary_op_get_predicated_value instead of bolting it to only > a single place? Removed function find_predicated_value_by_equivs and inlined the code. Because lookup_equiv_head uses vn_nary_op_get_predicated_value, so I left vn_nary_op_get_predicated_value unchanged. Instead, operands are set to equiv-heads in init_vn_nary_op_from_stmt. So altogether, predicates are always inserted and searched by equiv-heads. > + > +static vn_nary_op_t > +val_equiv_insert (tree op1, tree op2, edge e) > +{ > > + if (is_gimple_min_invariant (lhs)) > +std::swap (lhs, rhs); > + if (is_gimple_min_invariant (lhs) || TREE_CODE (lhs) != SSA_NAME) > +/* Possible if invoked from record_equiv_from_previous_cond. */ > +return NULL; > > Better formulate all of the above in terms of only SSA_NAME checks since... > > + /* Make the hand-side with more recorded n-ary expressions new > + * equivalence-head, to make fewer re-insertions. */ > + if (TREE_CODE (rhs) == SSA_NAME > + && VN_INFO
RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Hi, Attached is a new version of the patch, mainly for improving performance and simplifying the code. First, regarding the comments: > -Original Message- > From: Richard Biener > Sent: Friday, October 1, 2021 9:00 PM > To: Di Zhao OS > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with > "equivalence map" for condition prediction > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS > wrote: > > > > Sorry about updating on this after so long. It took me much time to work > > out a > > new plan and pass the tests. > > > > The new idea is to use one variable to represent a set of equal variables at > > some basic-block. This variable is called a "equivalence head" or > > "equiv-head" > > in the code. (There's no-longer a "equivalence map".) > > > > - Initially an SSA_NAME's "equivalence head" is its value number. Temporary > > equivalence heads are recorded as unary NOP_EXPR results in the > > vn_nary_op_t > > map. Besides, when inserting into vn_nary_op_t map, make the new result at > > front of the vn_pval list, so that when searching for a variable's > > equivalence head, the first result represents the largest equivalence set > > at > > current location. > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry. > > For recorded equivalences, the reference is result->entry; for normal > > N-ary > > operations, the reference is operand->entry. > > - When recording equivalences, if one side A is constant or has more refs, > > make > > it the new equivalence head of the other side B. Traverse B's ref-list, > > if a > > variable C's previous equiv-head is B, update to A. And re-insert B's > > n-ary > > operations by replacing B with A. > > - When inserting and looking for the results of n-ary operations, insert and > > lookup by the operands' equiv-heads. > > ... > > > > Thanks, > > Di Zhao > > > > > > Extend FRE with temporary equivalences. > > Comments on the patch: > > + /* nary_ref count. */ > + unsigned num_nary_ref; > + > > I think a unsigned short should be enough and that would nicely > pack after value_id together with the bitfield (maybe change that > to unsigned short :1 then). Changed num_nary_ref to unsigned short and moved after value_id. > @@ -7307,17 +7839,23 @@ process_bb (rpo_elim , basic_block bb, > tree val = gimple_simplify (gimple_cond_code (last), > boolean_type_node, lhs, rhs, > NULL, vn_valueize); > + vn_nary_op_t vnresult = NULL; > /* If the condition didn't simplfy see if we have recorded >an expression from sofar taken edges. */ > if (! val || TREE_CODE (val) != INTEGER_CST) > { > - vn_nary_op_t vnresult; > > looks like you don't need vnresult outside of the if()? vnresult is reused later to record equivalences generated by PHI nodes. > +/* Find predicated value of vn_nary_op by the operands' equivalences. Return > + * NULL_TREE if no known result is found. */ > + > +static tree > +find_predicated_value_by_equivs (vn_nary_op_t vno, basic_block bb, > +vn_nary_op_t *vnresult) > +{ > + lookup_equiv_heads (vno->length, vno->op, vno->op, bb); > + tree result > += simplify_nary_op (vno->length, vno->opcode, vno->op, vno->type); > > why is it necessary to simplify here? It looks like the caller > already does this. In the new patch, changed the code a little to remove redundant calculation. > I wonder whether it's valid to always perform find_predicated_value_by_equivs > from inside vn_nary_op_get_predicated_value instead of bolting it to only > a single place? Removed function find_predicated_value_by_equivs and inlined the code. Because lookup_equiv_head uses vn_nary_op_get_predicated_value, so I left vn_nary_op_get_predicated_value unchanged. Instead, operands are set to equiv-heads in init_vn_nary_op_from_stmt. So altogether, predicates are always inserted and searched by equiv-heads. > + > +static vn_nary_op_t > +val_equiv_insert (tree op1, tree op2, edge e) > +{ > > + if (is_gimple_min_invariant (lhs)) > +std::swap (lhs, rhs); > + if (is_gimple_min_invariant (lhs) || TREE_CODE (lhs) != SSA_NAME) > +/* Possible if invoked from record_equiv_from_previous_cond. */ > +return NULL; > > Better formulate all of the above in terms of only SSA_NAME checks since... > > + /* Make the hand-side with more recorded n-ary expressions new > + * equivalence-head, to make fewer re-insertions. */ > + if (TREE_CODE (rhs) == SSA_NAME > + && VN_INFO (rhs)->num_nary_ref < VN_INFO (lhs)->num_nary_ref) > +std::swap (lhs, rhs); > > here LHS needs to be an SSA_NAME. Tried to fix this in the new patch. > + /* Record equivalence as unary NOP_EXPR. */ > + vn_nary_op_t val > += vn_nary_op_insert_pieces_predicated_1 (1, NOP_EXPR,
PING: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Thanks, Di -Original Message- From: Gcc-patches On Behalf Of Di Zhao OS via Gcc-patches Sent: Friday, September 17, 2021 2:13 AM To: gcc-patches@gcc.gnu.org Subject: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction Sorry about updating on this after so long. It took me much time to work out a new plan and pass the tests. The new idea is to use one variable to represent a set of equal variables at some basic-block. This variable is called a "equivalence head" or "equiv-head" in the code. (There's no-longer a "equivalence map".) - Initially an SSA_NAME's "equivalence head" is its value number. Temporary equivalence heads are recorded as unary NOP_EXPR results in the vn_nary_op_t map. Besides, when inserting into vn_nary_op_t map, make the new result at front of the vn_pval list, so that when searching for a variable's equivalence head, the first result represents the largest equivalence set at current location. - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry. For recorded equivalences, the reference is result->entry; for normal N-ary operations, the reference is operand->entry. - When recording equivalences, if one side A is constant or has more refs, make it the new equivalence head of the other side B. Traverse B's ref-list, if a variable C's previous equiv-head is B, update to A. And re-insert B's n-ary operations by replacing B with A. - When inserting and looking for the results of n-ary operations, insert and lookup by the operands' equiv-heads. So except for the refs in vn_ssa_aux_t, this scheme uses the original infrastructure to its best. Quadric search time is avoided at the cost of some re-insertions. Test results on SPEC2017 intrate (counts and percentages): |more bb |more bb |more stmt|more stmt|more |more |removed |removed |removed |removed |nv_nary_ops|nv_nary_ops |at fre1 |at fre1 |at fre1 |at fre1 |inserted |inserted -- 500.perlbench_r| 64 | 1.98% | 103 | 0.19% | 11260 | 12.16% 502.gcc_r | 671| 4.80% | 317 | 0.23% | 13964 | 6.09% 505.mcf_r | 5 | 35.71% | 9 | 1.40% | 32| 2.52% 520.omnetpp| 132| 5.45% | 39 | 0.11% | 1895 | 3.91% 523.xalancbmk_r| 238| 3.26% | 313 | 0.36% | 1417 | 1.27% 525.x264_r | 4 | 1.36% | 27 | 0.11% | 1752 | 6.78% 531.deepsjeng_r| 1 | 3.45% | 2 | 0.14% | 228 | 8.67% 541.leela_r| 2 | 0.63% | 0 | 0.00% | 92| 1.14% 548.exchange2_r| 0 | 0.00% | 3 | 0.04% | 49| 1.03% 557.xz_r | 0 | 0.00% | 3 | 0.07% | 272 | 7.55% There're more basic_blocks and statements removed compared with last implementation, the reasons are: 1) "CONST op CONST" simplification is included. It is missed in previous patch. 2) By inserting RHS of statements on equiv-heads, more N-ary operations can be simplified. One example is in 'ssa-fre-97.c' in the patch file. While jump-threading & vrp also utilize temporary equivalences (so some of the newly removed blocks and statements can also be covered by them), I think this patch is a supplement, in cases when jump threading cannot take place (the original example), or value number info needs to be involved (the 'ssa-fre-97.c' example). Fixed the former issue with non-iterate mode. About recording the temporary equivalences generated by PHIs (i.e. the 'record_equiv_from_previous_cond' stuff), I have to admit it looks strange and the code size is large, but I haven't find a better way yet. Consider a piece of CFG like the one below, if we want to record x==x2 on the true edge when processing bb1, the location (following current practice) will be bb2. But that is not useful at bb5 or bb6, because bb2 doesn't dominate them. And I can't find a place to record x==x1 when processing bb1. If we can record things on edges rather than blocks, say x==x1 on 1->3 and x==x2 on 1->2, then perhaps with an extra check for "a!=0", x2 can be a valid equiv-head for x since bb5. But I think it lacks efficiency and is not persuasive. It is more efficient to find a valid previous predicate when processing bb4, because the vn_nary_op_t will be fetched anyway. -- | if (a != 0) | bb1 -- f | \ t |--- || bb2 | |--- | / - | x = PHI | bb3 - | | -- | if (a != 0) | bb4 -- |f \t - --- bb7 | where | | bb5 | ==&g
[PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
Sorry about updating on this after so long. It took me much time to work out a new plan and pass the tests. The new idea is to use one variable to represent a set of equal variables at some basic-block. This variable is called a "equivalence head" or "equiv-head" in the code. (There's no-longer a "equivalence map".) - Initially an SSA_NAME's "equivalence head" is its value number. Temporary equivalence heads are recorded as unary NOP_EXPR results in the vn_nary_op_t map. Besides, when inserting into vn_nary_op_t map, make the new result at front of the vn_pval list, so that when searching for a variable's equivalence head, the first result represents the largest equivalence set at current location. - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry. For recorded equivalences, the reference is result->entry; for normal N-ary operations, the reference is operand->entry. - When recording equivalences, if one side A is constant or has more refs, make it the new equivalence head of the other side B. Traverse B's ref-list, if a variable C's previous equiv-head is B, update to A. And re-insert B's n-ary operations by replacing B with A. - When inserting and looking for the results of n-ary operations, insert and lookup by the operands' equiv-heads. So except for the refs in vn_ssa_aux_t, this scheme uses the original infrastructure to its best. Quadric search time is avoided at the cost of some re-insertions. Test results on SPEC2017 intrate (counts and percentages): |more bb |more bb |more stmt|more stmt|more |more |removed |removed |removed |removed |nv_nary_ops|nv_nary_ops |at fre1 |at fre1 |at fre1 |at fre1 |inserted |inserted -- 500.perlbench_r| 64 | 1.98% | 103 | 0.19% | 11260 | 12.16% 502.gcc_r | 671| 4.80% | 317 | 0.23% | 13964 | 6.09% 505.mcf_r | 5 | 35.71% | 9 | 1.40% | 32| 2.52% 520.omnetpp| 132| 5.45% | 39 | 0.11% | 1895 | 3.91% 523.xalancbmk_r| 238| 3.26% | 313 | 0.36% | 1417 | 1.27% 525.x264_r | 4 | 1.36% | 27 | 0.11% | 1752 | 6.78% 531.deepsjeng_r| 1 | 3.45% | 2 | 0.14% | 228 | 8.67% 541.leela_r| 2 | 0.63% | 0 | 0.00% | 92| 1.14% 548.exchange2_r| 0 | 0.00% | 3 | 0.04% | 49| 1.03% 557.xz_r | 0 | 0.00% | 3 | 0.07% | 272 | 7.55% There're more basic_blocks and statements removed compared with last implementation, the reasons are: 1) "CONST op CONST" simplification is included. It is missed in previous patch. 2) By inserting RHS of statements on equiv-heads, more N-ary operations can be simplified. One example is in 'ssa-fre-97.c' in the patch file. While jump-threading & vrp also utilize temporary equivalences (so some of the newly removed blocks and statements can also be covered by them), I think this patch is a supplement, in cases when jump threading cannot take place (the original example), or value number info needs to be involved (the 'ssa-fre-97.c' example). Fixed the former issue with non-iterate mode. About recording the temporary equivalences generated by PHIs (i.e. the 'record_equiv_from_previous_cond' stuff), I have to admit it looks strange and the code size is large, but I haven't find a better way yet. Consider a piece of CFG like the one below, if we want to record x==x2 on the true edge when processing bb1, the location (following current practice) will be bb2. But that is not useful at bb5 or bb6, because bb2 doesn't dominate them. And I can't find a place to record x==x1 when processing bb1. If we can record things on edges rather than blocks, say x==x1 on 1->3 and x==x2 on 1->2, then perhaps with an extra check for "a!=0", x2 can be a valid equiv-head for x since bb5. But I think it lacks efficiency and is not persuasive. It is more efficient to find a valid previous predicate when processing bb4, because the vn_nary_op_t will be fetched anyway. -- | if (a != 0) | bb1 -- f | \ t |--- || bb2 | |--- | / - | x = PHI | bb3 - | | -- | if (a != 0) | bb4 -- |f \t - --- bb7 | where | | bb5 | ==> where "x==x2" is recorded now | "x==x1" is| --- | recorded |\ | now |... - | --- | bb6 | ==> where "x==x2" needs to be used --- Although I think I can remove the 'dominator_to_phi_map' and generalize this a little, but the major logic will be similar. So I
[PATCH] tree-optimization/102183 - sccvn: fix result compare in vn_nary_op_insert_into
If the first predicate value is different and copied, the comparison will then be between val->result and the copied one, which seems to be a bug. That can cause inserting extra vn_pvals. Bootstrapped and tested on x86_64-unknown-linux-gnu. Regards, Di Zhao gcc/ChangeLog: * tree-ssa-sccvn.c (vn_nary_op_insert_into): fix result compare --- gcc/tree-ssa-sccvn.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c index 2357bbdbf90..bfa516b6cf9 100644 --- a/gcc/tree-ssa-sccvn.c +++ b/gcc/tree-ssa-sccvn.c @@ -4105,7 +4105,7 @@ vn_nary_op_insert_into (vn_nary_op_t vno, vn_nary_op_table_type *table, bool found = false; for (vn_pval *val = (*slot)->u.values; val; val = val->next) { - if (expressions_equal_p (val->result, vno->u.values->result)) + if (expressions_equal_p (val->result, nval->result)) { found = true; for (unsigned i = 0; i < val->n; ++i) -- 2.25.1
RE: [PATCH] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction
I tried to improve the patch following your advices and to catch more opportunities. Hope it'll be helpful. On 6/24/21 8:29 AM, Richard Biener wrote: > On Thu, Jun 24, 2021 at 11:55 AM Di Zhao via Gcc-patches patc...@gcc.gnu.org> wrote: > > I have some reservations about extending the ad-hoc "predicated value" code. > > Some comments on the patch: > > +/* hashtable & helpers to record equivalences at given bb. */ > + > +typedef struct val_equiv_s > +{ > + val_equiv_s *next; > + val_equiv_s *unwind_to; > + hashval_t hashcode; > + /* SSA name this val_equiv_s is associated with. */ > + tree name; > + /* RESULT in a vn_pval entry is SSA name of a equivalence. */ > + vn_pval *values; > +} * val_equiv_t; > > all of this (and using a hashtable for recording) is IMHO a bit overkill. > Since you only ever record equivalences for values the more natural place to > hook those in is the vn_ssa_aux structure where we also record the > availability > chain. I tried to store the equivalences in the vn_ssa_aux structure, but I didn't optimize the second case successfully: I need to record the equivalence of a PHI expression's result and arguments, but their value number results will become VARYING first, so they won't be changed. Maybe I'm missing something, or can I force change a VARYING result? Besides, the way currently used, equivalences only need to be "predictable" rather than available, maybe availability chains do not represent them very well? > There's little commentary in the new code, in particular function-level > comments are missing everywhere. Added more comments. > There's complexity issues, like I see val_equiv_insert has a "recurse" > feature but also find_predicated_value_by_equiv is quadratic in the number of > equivalences of the lhs/rhs. Without knowing what the recursion on the > former is for - nothing tells me - I suspect either of both should be > redundant. The intention was, given {A==B, B==C, X==Y, Y==Z} and a previous result of "C opcode Z", to find the result of "A opcode Y". I removed the "recurse" feature and modified the searching logic so solve the issue. Now a temporary hash_set is used to record the equivalences that are visited when searching. > You seem to record equivalences at possible use points which looks odd at best > - I'd expected equivalences being recorded at the same point we record > predicated values and for the current condition, not the one determining some > other predication. > What was the motivation to do it the way you do it? The purpose is to "bring down" what can be known from a previous basic-block that effectively dominates current block, but not actually does so (in the example it is because jump threading is hindered by a loop). For example in this case: if (a != 0) // Nothing useful can be recorded here, because this BB doesn't dominate // the BB that we want to simplify. c = b; for (unsigned i = 0; i < c; i++) { if (a != 0) // The recording is triggered here. { // c == b will be recorded here, so it can be used for simplification. // In gimple it is the equivalence of a PHI's result and argument. if (i >= b) foo (); These requires finding a previous condition that is identical with current one, so it is convenient to do this in FRE. Besides, as FRE records derived predicate, so for relative conditions there also might be opportunities for optimization. In the new patch code this is included. Besides, to find more opportunities, added a hashmap to store mappings from immediate dominators to basic-blocks with PHIs of interest. > Why is the code conditional on 'iterate'? I haven't worked it out to fit the non-iterate mode, so it now breaks the if_conversion pass. I think this is because some of the equivalence-recordings are too optimistic for non-iterate mode. > You've probably realized that there's no "nice" way to handle temporary > equivalences in a VN scheme using hashing for expressions (unless you degrade > hashes a lot). I modified the code to use TREE_HASH on ssa names. Would that be better? > You quote opportunities that are catched with this like > > + if (a != 0) > +{ > + c = b; > +} > + for (unsigned i = 0; i < c; i++) > +{ > + if (a != 0) > + { > + if (i >= b) > + /* Should be eliminated. > +*/ > + foo (); > > but say other "obvious" cases like > > + if (a != 0) > +{ > + c = b; > +} > + for (unsigned i = 0; i < c; i++) > +{ > + if (a != 0) > + { > + /* Should be zero. */ > return b - c; > > are not handled. That's in line with the current "value predication" > which mainly aims at catching simple jump threading opportunities; you only > simplify conditions with the recorded equivalences. But then the complexity > of > handling equivalences does probably not outweight the opportunities catched -