RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

Di Zhao OS via Gcc-patches Mon, 04 Sep 2023 03:33:42 -0700

> -----Original Message-----
> From: Richard Biener <richard.guent...@gmail.com>
> Sent: Thursday, August 31, 2023 8:23 PM
> To: Di Zhao OS <diz...@os.amperecomputing.com>
> Cc: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz>; gcc-
> patc...@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Wed, Aug 30, 2023 at 11:33 AM Di Zhao OS
> <diz...@os.amperecomputing.com> wrote:
> >
> > Hello Richard,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guent...@gmail.com>
> > > Sent: Tuesday, August 29, 2023 7:11 PM
> > > To: Di Zhao OS <diz...@os.amperecomputing.com>
> > > Cc: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz>;
> gcc-
> > > patc...@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
> > > <diz...@os.amperecomputing.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guent...@gmail.com>
> > > > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > > > To: Di Zhao OS <diz...@os.amperecomputing.com>
> > > > > Cc: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz>;
> > > gcc-
> > > > > patc...@gcc.gnu.org
> > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> reassoc
> > > to
> > > > > reduce cross backedge FMA
> > > > >
> > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > > > > <diz...@os.amperecomputing.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <richard.guent...@gmail.com>
> > > > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > > > To: Jeff Law <jeffreya...@gmail.com>; Martin Jambor
> <mjam...@suse.cz>
> > > > > > > Cc: Di Zhao OS <diz...@os.amperecomputing.com>; gcc-
> > > patc...@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> > > reassoc
> > > > > to
> > > > > > > reduce cross backedge FMA
> > > > > > >
> > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-
> July/624893.html)
> > > > > > > > >
> > > > > > > > > 1. Add testcases for the problem. For an op list in the form
> of
> > > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > > > After widening_mul the result looks like:
> > > > > > > > >
> > > > > > > > >     _1 = .FMA(a, b, acc_0);
> > > > > > > > >     acc_1 = .FMA(c, d, _1);
> > > > > > > > >
> > > > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > > > widening_mul's result was like:
> > > > > > > > >
> > > > > > > > >     _1 = a * b;
> > > > > > > > >     _2 = .FMA (c, d, _1);
> > > > > > > > >     acc_1 = acc_0 + _2;
> > > > > > >
> > > > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > > > on the add operand but only when it is with a multiplication and
> > > > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > > > >
> > > > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > > > other case?
> > > >
> > > > Here's my understanding after consulted our hardware team. For the
> > > > second case, the uarch of some out-of-order processors can calculate
> > > > "_2" of several loops at the same time, since there's no dependency
> > > > among different iterations. While for the first case the next iteration
> > > > has to wait for the current iteration to finish, so "acc_0"'s value is
> > > > known. I assume it is also the case in some i386 processors, since I
> > > > saw the patch "Deferring FMA transformations in tight loops" also
> > > > changed corresponding files.
> > >
> > > That should be true for all kind of operations, no?  Thus it means
> > > reassoc should in general associate cross-iteration accumulation
> > Yes I think both are true.
> >
> > > last?  Historically we associated those first because that's how the
> > > vectorizer liked to see them, but I think that's no longer necessary.
> > >
> > > It should be achievable by properly biasing the operand during
> > > rank computation (don't we already do that?).
> >
> > The issue is related with the following codes (handling cases with
> > three operands left):
> >       /* When there are three operands left, we want
> >          to make sure the ones that get the double
> >          binary op are chosen wisely.  */
> >       int len = ops.length ();
> >       if (len >= 3 && !has_fma)
> >         swap_ops_for_binary_stmt (ops, len - 3);
> >
> >       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> >                                    powi_result != NULL
> >                                    || negate_result,
> >                                    len != orig_len);
> >
> > Originally (before the "Handle FMA friendly..." patch), for the
> > tiny example, the 2 multiplications will be placed first by
> > swap_ops_for_binary_stmt and rewrite_expr_tree, according to
> > ranks. While currently, to preserve more FMAs,
> > swap_ops_for_binary_stmt won't be called, so the result would
> > be MULT_EXPRs and PLUS_EXPRs interleaved with each other (which
> > is mostly fine if these are not in such tight loops).
> >
> > What this patch tries to do can be summarized as: when cross
> > backedge dependency is detected (and the uarch doesn't like it),
> > better fallback to the "old" way, since we don't want the loop
> > depending FMA anyway. (I hope I'm understanding the question
> > correctly.)
> 
> Hmm, I'd argue this should be
> 
>  if (len >= 3 && (!has_fma || (width = get_reassociation_width (...)) > 1))
>   swap_ops_for_binary_stmt (ops, len - 3);
> 
> to cover the case of exactly 3 ops - this is for exactly 3 ops, right?


Yes, this is better.

> In principle we could, in swap_ops_for_binary_stmt, for has_fma
> avoid doing sth when no PHI recurrence is involved.  But the point
> is the target can execute things in parallel, and the thing to check for
> this is the reassociation-width (even if its for extending the chain
> via speculative execution across a backedge).

Attached is a modified version of the patch. I used a new function
(swap_for_better_reassoc_width_p) to check for cross backedge
dependency rather than extending get_reassociation_width, considering
we want a match only if the width without swapping is worse than
swapped. For a case like "acc = a * b + c * d + e" (without loop
dependency), these uarchs can execute the code in parallel whether
it's swapped or not. In this case it's better not to swap the
operands so more FMAs can be generated.

> 
> > >
> > > > > > >
> > > > > > > > > If the code fragment is in a loop, some architecture can
> execute
> > > > > > > > > the latter in parallel, so the performance can be much faster
> > > than
> > > > > > > > > the former. For the small testcase, the performance gap is
> over
> > > > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to
> > > avoid
> > > > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR
> as
> > > > > > > > > much as possible. (If we are rewriting the op list into
> parallel,
> > > > > > > > > no special treatment is needed, since the last statement
> after
> > > > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > > > > >
> > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for
> cross
> > > > > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > > > > state of FMA candidates.
> > > > > > > > >
> > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-
> copy
> > > > > > > > > run on ampere1. The compile options are:
> > > > > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Di Zhao
> > > > > > > > >
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > >          PR tree-optimization/110279
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > > > > >          describe the state of FMA candidates for an op list.
> > > > > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > > > > >          parameter to enum type.
> > > > > > > > >          (result_feeds_back_from_phi_p): New function to
> check
> > > > > > > > >          for cross backedge dependency.
> > > > > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > > > > >          parameter.
> > > > > > > > >          (reassociate_bb): If there's backedge dependency in
> an
> > > > > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > > > > Not a review, but more of a question -- isn't this
> transformation's
> > > > > > > > profitability uarch sensitive.  ie, just because it's bad for a
> set
> > > of
> > > > > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > > > > >
> > > > > > > > And in general we shy away from trying to adjust gimple code
> based
> > > on
> > > > > > > > uarch preferences.
> > > > > > > >
> > > > > > > > It seems the right place to do this is gimple->rtl expansion.
> > > > > > >
> > > > > > > Another comment is that FMA forming has this deferring code which
> I
> > > > > > > think deals exactly with this kind of thing?  CCing Martin who
> did
> > > this
> > > > > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > > > > on FMAs (or so).  In particular I see
> > > > > > >
> > > > > > >   if (fma_state.m_deferring_p
> > > > > > >       && fma_state.m_initial_phi)
> > > > > > >     {
> > > > > > >       gcc_checking_assert (fma_state.m_last_result);
> > > > > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > > > > >
> &m_last_result_set))
> > > > > > >         cancel_fma_deferring (&fma_state);
> > > > > > >
> > > > > > > and I think code to avoid FMAs in other/related cases should be
> here
> > > > > > > as well, like avoid forming back-to-back FMAs.
> > > > > >
> > > > > > The changes in this patch is controlled by
> "param_avoid_fma_max_bits",
> > > so
> > > > > > I think it should only affect architectures with similar behavior.
> (The
> > > > > > parameter was added in a previous patch "Deferring FMA
> transformations
> > > > > > in tight loops", which seems to be dealing with the same issue.)
> > > > >
> > > > > That's what I said.  Is the pipeline behavior properly modeled by the
> > > > > scheduler description?  Your patch seems to not only affect loops
> > > > > but the FMA forming case already present does - is the case you are
> > > > > after not in a tight loop?  In principle with a loop with enough
> > > scheduling
> > > > > freedom this should be a non-issue - unless the pipeline model isn't
> > > > > accurate and we put the FMAs too close together?
> > > >
> > > > The current deferring FMA scheme discards all the FMA candidates in the
> FMA
> > > > chain that has loop dependency. So for this case there will be too less
> > > (zero)
> > > > FMA generated; while before the "Handle FMA friendly..." patch or after
> > > this
> > > > patch we can still have 1 FMA. I also tried to change the deferring FMA
> > > scheme
> > > > so only the last FMA is discarded, but it still can't solve the
> regression.
> > > > I think it's because this deferring scheme is in the later pass
> > > widening_mul,
> > > > which only decides whether to form FMA, but doesn't manipulate the tree
> for
> > > > better data parallelism. So it seems better to optimize for data
> > > parallelism
> > > > (swapping the operands) in reassoc if we know there can be FMAs with
> loop
> > > > dependency, and leave the current deferring FMA scheme as the last
> resort?
> > > >
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Jeff
> > > > > >
> > > >
> > > >
> > > > Thanks,
> > > > Di Zhao
> >
> >
> > Thanks,
> > Di Zhao

Thanks,
Di Zhao


----
        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (swap_for_better_reassoc_width_p):
        New function to check whether ranking the ops results in
        better parallelism.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

0001-PATCH-swap-operands-in-reassoc-to-reduce-cross-backe.patch
Description: 0001-PATCH-swap-operands-in-reassoc-to-reduce-cross-backe.patch

RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

Reply via email to