On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
This patch tries to fix the 2% regression in 510.parest_r on
ampere1 in the tracker. (Previous discussion is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
1. Add testcases for the problem. For an op list in the form of
"acc = a * b + c * d + acc", currently reassociation doesn't
Swap the operands so that more FMAs can be generated.
After widening_mul the result looks like:
_1 = .FMA(a, b, acc_0);
acc_1 = .FMA(c, d, _1);
While previously (before the "Handle FMA friendly..." patch),
widening_mul's result was like:
_1 = a * b;
_2 = .FMA (c, d, _1);
acc_1 = acc_0 + _2;
If the code fragment is in a loop, some architecture can execute
the latter in parallel, so the performance can be much faster than
the former. For the small testcase, the performance gap is over
10% on both ampere1 and neoverse-n1. So the point here is to avoid
turning the last statement into FMA, and keep it a PLUS_EXPR as
much as possible. (If we are rewriting the op list into parallel,
no special treatment is needed, since the last statement after
rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
2. Function result_feeds_back_from_phi_p is to check for cross
backedge dependency. Added new enum fma_state to describe the
state of FMA candidates.
With this patch, there's a 3% improvement in 510.parest_r 1-copy
run on ampere1. The compile options are:
"-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
Best regards,
Di Zhao
----
PR tree-optimization/110279
gcc/ChangeLog:
* tree-ssa-reassoc.cc (enum fma_state): New enum to
describe the state of FMA candidates for an op list.
(rewrite_expr_tree_parallel): Changed boolean
parameter to enum type.
(result_feeds_back_from_phi_p): New function to check
for cross backedge dependency.
(rank_ops_for_fma): Return enum fma_state. Added new
parameter.
(reassociate_bb): If there's backedge dependency in an
op list, swap the operands before rewrite_expr_tree.
gcc/testsuite/ChangeLog:
* gcc.dg/pr110279.c: New test.
Not a review, but more of a question -- isn't this transformation's
profitability uarch sensitive. ie, just because it's bad for a set of
aarch64 uarches, doesn't mean it's bad everywhere.
And in general we shy away from trying to adjust gimple code based on
uarch preferences.
It seems the right place to do this is gimple->rtl expansion.
Jeff