On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS <diz...@os.amperecomputing.com> wrote: > > Hi, > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Tuesday, August 29, 2023 4:09 PM > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > Cc: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz>; gcc- > > patc...@gcc.gnu.org > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > > reduce cross backedge FMA > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS > > <diz...@os.amperecomputing.com> wrote: > > > > > > Hi, > > > > > > > -----Original Message----- > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > Sent: Tuesday, August 29, 2023 3:41 PM > > > > To: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz> > > > > Cc: Di Zhao OS <diz...@os.amperecomputing.com>; gcc-patches@gcc.gnu.org > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc > > to > > > > reduce cross backedge FMA > > > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > > > ampere1 in the tracker. (Previous discussion is here: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) > > > > > > > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > > > > Swap the operands so that more FMAs can be generated. > > > > > > After widening_mul the result looks like: > > > > > > > > > > > > _1 = .FMA(a, b, acc_0); > > > > > > acc_1 = .FMA(c, d, _1); > > > > > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > > > widening_mul's result was like: > > > > > > > > > > > > _1 = a * b; > > > > > > _2 = .FMA (c, d, _1); > > > > > > acc_1 = acc_0 + _2; > > > > > > > > How can we execute the multiply and the FMA in parallel? They > > > > depend on each other. Or is it the uarch can handle dependence > > > > on the add operand but only when it is with a multiplication and > > > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > > > > > Can you explain in more detail how the uarch executes one vs. the > > > > other case? > > Here's my understanding after consulted our hardware team. For the > second case, the uarch of some out-of-order processors can calculate > "_2" of several loops at the same time, since there's no dependency > among different iterations. While for the first case the next iteration > has to wait for the current iteration to finish, so "acc_0"'s value is > known. I assume it is also the case in some i386 processors, since I > saw the patch "Deferring FMA transformations in tight loops" also > changed corresponding files.
That should be true for all kind of operations, no? Thus it means reassoc should in general associate cross-iteration accumulation last? Historically we associated those first because that's how the vectorizer liked to see them, but I think that's no longer necessary. It should be achievable by properly biasing the operand during rank computation (don't we already do that?). > > > > > > > > > > If the code fragment is in a loop, some architecture can execute > > > > > > the latter in parallel, so the performance can be much faster than > > > > > > the former. For the small testcase, the performance gap is over > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as > > > > > > much as possible. (If we are rewriting the op list into parallel, > > > > > > no special treatment is needed, since the last statement after > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) > > > > > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross > > > > > > backedge dependency. Added new enum fma_state to describe the > > > > > > state of FMA candidates. > > > > > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy > > > > > > run on ampere1. The compile options are: > > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512". > > > > > > > > > > > > Best regards, > > > > > > Di Zhao > > > > > > > > > > > > ---- > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * tree-ssa-reassoc.cc (enum fma_state): New enum to > > > > > > describe the state of FMA candidates for an op list. > > > > > > (rewrite_expr_tree_parallel): Changed boolean > > > > > > parameter to enum type. > > > > > > (result_feeds_back_from_phi_p): New function to check > > > > > > for cross backedge dependency. > > > > > > (rank_ops_for_fma): Return enum fma_state. Added new > > > > > > parameter. > > > > > > (reassociate_bb): If there's backedge dependency in an > > > > > > op list, swap the operands before rewrite_expr_tree. > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > * gcc.dg/pr110279.c: New test. > > > > > Not a review, but more of a question -- isn't this transformation's > > > > > profitability uarch sensitive. ie, just because it's bad for a set of > > > > > aarch64 uarches, doesn't mean it's bad everywhere. > > > > > > > > > > And in general we shy away from trying to adjust gimple code based on > > > > > uarch preferences. > > > > > > > > > > It seems the right place to do this is gimple->rtl expansion. > > > > > > > > Another comment is that FMA forming has this deferring code which I > > > > think deals exactly with this kind of thing? CCing Martin who did this > > > > work based on AMD uarchs also not wanting cross-loop dependences > > > > on FMAs (or so). In particular I see > > > > > > > > if (fma_state.m_deferring_p > > > > && fma_state.m_initial_phi) > > > > { > > > > gcc_checking_assert (fma_state.m_last_result); > > > > if (!last_fma_candidate_feeds_initial_phi (&fma_state, > > > > &m_last_result_set)) > > > > cancel_fma_deferring (&fma_state); > > > > > > > > and I think code to avoid FMAs in other/related cases should be here > > > > as well, like avoid forming back-to-back FMAs. > > > > > > The changes in this patch is controlled by "param_avoid_fma_max_bits", so > > > I think it should only affect architectures with similar behavior. (The > > > parameter was added in a previous patch "Deferring FMA transformations > > > in tight loops", which seems to be dealing with the same issue.) > > > > That's what I said. Is the pipeline behavior properly modeled by the > > scheduler description? Your patch seems to not only affect loops > > but the FMA forming case already present does - is the case you are > > after not in a tight loop? In principle with a loop with enough scheduling > > freedom this should be a non-issue - unless the pipeline model isn't > > accurate and we put the FMAs too close together? > > The current deferring FMA scheme discards all the FMA candidates in the FMA > chain that has loop dependency. So for this case there will be too less (zero) > FMA generated; while before the "Handle FMA friendly..." patch or after this > patch we can still have 1 FMA. I also tried to change the deferring FMA scheme > so only the last FMA is discarded, but it still can't solve the regression. > I think it's because this deferring scheme is in the later pass widening_mul, > which only decides whether to form FMA, but doesn't manipulate the tree for > better data parallelism. So it seems better to optimize for data parallelism > (swapping the operands) in reassoc if we know there can be FMAs with loop > dependency, and leave the current deferring FMA scheme as the last resort? > > > > > Richard. > > > > > > Richard. > > > > > > > > > Jeff > > > > > > Thanks, > Di Zhao