Re: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

2023-09-01 Thread Richard Biener via Gcc-patches
On Wed, Aug 9, 2023 at 6:53 PM Di Zhao OS  wrote:
>
> Hi,
>
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
>
> Other changes:
>
> 1. Added new testcases for the "nested" FMA issue. For the
>following code:
>
> tmp1 = a + c * c + d * d + x * y;
> tmp2 = x * tmp1;
> result += (a + c + d + tmp2);
>
>, when "tmp1 = ..." is not rewritten, tmp1 will be result of
>an FMA, and there will be a list of consecutive FMAs:
>
> _1 = .FMA (c, c, a_39);
> _2 = .FMA (d, d, _1);
> tmp1 = .FMA (x, y, _2);
> _3 = .FMA (tmp1, x, d);
> ...
>
>If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>of a PLUS_EXPR between FMAs:
>
> _1 = .FMA (c, c, a_39);
> _2 = x * y;
> _3 = .FMA (d, d, _2);
>  tmp1 = _3 + _1;
>  _4 = .FMA (tmp1, x, d);
> ...
>
>It seems the register pressure of the latter is higher than
>the former.

Yes, that's a natural consequence of rewriting to parallel.

>On the test machines we have (including Ampere1,
>Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>parallel, the run time all increased significantly. In
>contrast, when "tmp1" is not the 1st or 2nd operand of another
>FMA (pr110279-1.c), rewriting it results in better performance.
>(I'll also append the testcases in the bug tracker.)
>
> 2. Enhanced checking for nested FMA by: 1) Modified
>convert_mult_to_fma so it can return multiple LHS.  2) Check
>NEGATE_EXPRs for nested FMA.
>
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )

So again, what you do applies to all operations, not just FMA.
Consider

  tmp1 = a + c + d + y;
  tmp2 = x + tmp1;
  result += (a + c + d + tmp2);
  foo (tmp2);

where I just removed all multiplications.  Since re-assoc works
on isolated single-use chains it will rewrite the tmp2 chain
to parallel and it will rewrite the result chain to parallel, in
the end this results in reassoc-width for 'result' to not be honored
because we don't see that at 'tmp2' it will fork again.  OTOH
the other 'result' arms end, so eventually just two (for reassoc
width 2) arms are "active" at any point.

That said - isn't the issue that we "overcommit" reassoc-width
this way because we apply it locally instead of globally
(of course also ignoring every other chain of instructions
reassoc isn't interestedin)?

Unfortunately we work backwards when processing chains,
if we processed leaf chains first we could record the
association width applied to the chain at its new root and
honor that when such root ends up in the oplist of a consuming
chain.  But as we work backwards we'd have to record
the reassoc width used to in the leafs of the associated
chain.  So if those become roots of other chains we can
then still honor that.

Would it work to attack the problem this way?  For
parallel rewritten chains record the width used?
Similar to operand_rank we could use a hash-map
from SSA leaf to width it appears in.  When we rewrite
a chain with such leaf as root we can then subtract
the incoming chain width from reassoc-width to lower
the width its tail?

Richard.

> Bootstrapped and regression tested on x86_64-linux-gnu.
>
> Thanks,
> Di Zhao
>
> 
>
> PR tree-optimization/110279
>
> gcc/ChangeLog:
>
> * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
> new parameter collect_lhs.
> (struct fma_transformation_info): Moved to header.
> (class fma_deferring_state): Moved to header.
> (convert_mult_to_fma): Added new parameter collect_lhs.
> * tree-ssa-math-opts.h (struct fma_transformation_info):
> (class fma_deferring_state): Moved from .cc.
> (convert_mult_to_fma): Moved from .cc.
> * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
> describe the state of FMA candidates for a list of
> operands.
> (rewrite_expr_tree_parallel): Changed boolean parameter
> to enum type.
> (has_nested_fma_p): New function to check for nested FMA
> on given multiplication statement.
> (rank_ops_for_fma): Return enum fma_state.
> (reassociate_bb): Avoid rewriting to parallel if nested
> FMAs are found.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/pr110279-1.c: New test.
> * gcc.dg/pr110279-2.c: New test.


RE: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

2023-08-18 Thread Di Zhao OS via Gcc-patches
Hi,

A few updates to the patch:

1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete
   FMA chain, since the regression is obvious only in this case.

2. Added new testcase.

Thanks,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
new parameter collect_lhs.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter collect_lhs.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Moved from .cc.
* tree-ssa-reassoc.cc (enum fma_state): Defined enum to
describe the state of FMA candidates for a list of
operands.
(rewrite_expr_tree_parallel): Changed boolean parameter
to enum type.
(has_nested_fma_p): New function to check for nested FMA
on given multiplication statement.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested
FMAs are found.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279-1.c: New test.
* gcc.dg/pr110279-2.c: New test.
* gcc.dg/pr110279-3.c: New test.

> -Original Message-
> From: Di Zhao OS
> Sent: Thursday, August 10, 2023 12:53 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener 
> Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
> 
> Hi,
> 
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
> 
> Other changes:
> 
> 1. Added new testcases for the "nested" FMA issue. For the
>following code:
> 
>   tmp1 = a + c * c + d * d + x * y;
>   tmp2 = x * tmp1;
>   result += (a + c + d + tmp2);
> 
>, when "tmp1 = ..." is not rewritten, tmp1 will be result of
>an FMA, and there will be a list of consecutive FMAs:
> 
>   _1 = .FMA (c, c, a_39);
>   _2 = .FMA (d, d, _1);
>   tmp1 = .FMA (x, y, _2);
>   _3 = .FMA (tmp1, x, d);
>   ...
> 
>If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>of a PLUS_EXPR between FMAs:
> 
>   _1 = .FMA (c, c, a_39);
>   _2 = x * y;
>   _3 = .FMA (d, d, _2);
>tmp1 = _3 + _1;
>_4 = .FMA (tmp1, x, d);
>   ...
> 
>It seems the register pressure of the latter is higher than
>the former. On the test machines we have (including Ampere1,
>Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>parallel, the run time all increased significantly. In
>contrast, when "tmp1" is not the 1st or 2nd operand of another
>FMA (pr110279-1.c), rewriting it results in better performance.
>(I'll also append the testcases in the bug tracker.)
> 
> 2. Enhanced checking for nested FMA by: 1) Modified
>convert_mult_to_fma so it can return multiple LHS.  2) Check
>NEGATE_EXPRs for nested FMA.
> 
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )
> 
> Bootstrapped and regression tested on x86_64-linux-gnu.
> 
> Thanks,
> Di Zhao



0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch
Description: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch


[PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

2023-08-09 Thread Di Zhao OS via Gcc-patches
Hi,

The previous version of this patch tries to solve two problems
at the same time. For better clarity, I'll separate them and 
only deal with the "nested" FMA in this version. I plan to
propose another patch in avoiding bad shaped FMA (deferring FMA).

Other changes:

1. Added new testcases for the "nested" FMA issue. For the
   following code:

tmp1 = a + c * c + d * d + x * y;
tmp2 = x * tmp1;
result += (a + c + d + tmp2);

   , when "tmp1 = ..." is not rewritten, tmp1 will be result of
   an FMA, and there will be a list of consecutive FMAs: 

_1 = .FMA (c, c, a_39);
_2 = .FMA (d, d, _1);
tmp1 = .FMA (x, y, _2);
_3 = .FMA (tmp1, x, d);
...
   
   If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
   of a PLUS_EXPR between FMAs:

_1 = .FMA (c, c, a_39);
_2 = x * y;
_3 = .FMA (d, d, _2);
 tmp1 = _3 + _1;
 _4 = .FMA (tmp1, x, d);
...

   It seems the register pressure of the latter is higher than
   the former. On the test machines we have (including Ampere1,
   Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
   parallel, the run time all increased significantly. In
   contrast, when "tmp1" is not the 1st or 2nd operand of another
   FMA (pr110279-1.c), rewriting it results in better performance.
   (I'll also append the testcases in the bug tracker.)

2. Enhanced checking for nested FMA by: 1) Modified
   convert_mult_to_fma so it can return multiple LHS.  2) Check
   NEGATE_EXPRs for nested FMA.

(I think maybe this can be further refined by enabling rewriting
to parallel for very long op list. )

Bootstrapped and regression tested on x86_64-linux-gnu.

Thanks,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
new parameter collect_lhs.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter collect_lhs.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Moved from .cc.
* tree-ssa-reassoc.cc (enum fma_state): Defined enum to
describe the state of FMA candidates for a list of
operands.
(rewrite_expr_tree_parallel): Changed boolean parameter
to enum type.
(has_nested_fma_p): New function to check for nested FMA
on given multiplication statement.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested
FMAs are found.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279-1.c: New test.
* gcc.dg/pr110279-2.c: New test.


tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch