[PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width

2023-09-14 Thread Di Zhao OS via Gcc-patches
This is a new version of the patch on "nested FMA".
Sorry for updating this after so long, I've been studying and
writing micro cases to sort out the cause of the regression.

First, following previous discussion:
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)

1. From testing more altered cases, I don't think the
problem is that reassociation works locally. In that:

  1) On the example with multiplications:

tmp1 = a + c * c + d * d + x * y;
tmp2 = x * tmp1;
result += (a + c + d + tmp2);

  Given "result" rewritten by width=2, the performance is
  worse if we rewrite "tmp1" with width=2. In contrast, if we
  remove the multiplications from the example (and make "tmp1"
  not singe used), and still rewrite "result" by width=2, then
  rewriting "tmp1" with width=2 is better. (Make sense because
  the tree's depth at "result" is still smaller if we rewrite
  "tmp1".)

  2) I tried to modify the assembly code of the example without
  FMA, so the width of "result" is 4. On Ampere1 there's no
  obvious improvement. So although this is an interesting
  problem, it doesn't seem like the cause of the regression.

2. From assembly code of the case with FMA, one problem is
that, rewriting "tmp1" to parallel didn't decrease the
minimum CPU cycles (taking MULT_EXPRs into account), but
increased code size, so the overhead is increased.

   a) When "tmp1" is not re-written to parallel:
fmadd d31, d2, d2, d30
fmadd d31, d3, d3, d31
fmadd d31, d4, d5, d31  //"tmp1"
fmadd d31, d31, d4, d3

   b) When "tmp1" is re-written to parallel:
fmul  d31, d4, d5  
fmadd d27, d2, d2, d30 
fmadd d31, d3, d3, d31 
fadd  d31, d31, d27 //"tmp1"
fmadd d31, d31, d4, d3

For version a), there are 3 dependent FMAs to calculate "tmp1".
For version b), there are also 3 dependent instructions in the
longer path: the 1st, 3rd and 4th.

So it seems to me the current get_reassociation_width algorithm
isn't optimal in the presence of FMA. So I modified the patch to
improve get_reassociation_width, rather than check for code
patterns. (Although there could be some other complicated
factors so the regression is more obvious when there's "nested
FMA". But with this patch that should be avoided or reduced.)

With this patch 508.namd_r 1-copy run has 7% improvement on
Ampere1, on Intel Xeon there's about 3%. While I'm still
collecting data on other CPUs, I'd like to know how do you
think of this.

About changes in the patch:

1. When the op list forms a complete FMA chain, try to search
for a smaller width considering the benefit of using FMA. With
a smaller width, the increment of code size is smaller when
breaking the chain.

2. To avoid regressions, included the other patch
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
on this tracker again. This is because more FMA will be kept
with 1., so we need to rule out the loop dependent
FMA chains when param_avoid_fma_max_bits is set.

Thanks,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
New function to check whether ranking the ops results in
better parallelism.
(get_reassociation_width): Add new parameters. Search for
smaller width considering the benefit of FMA.
(rank_ops_for_fma): Change return value to be number of
MULT_EXPRs.
(reassociate_bb): For 3 ops, refine the condition to call
swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279.c: New test.


0001-Consider-FMA-in-get_reassociation_width.patch
Description: 0001-Consider-FMA-in-get_reassociation_width.patch


RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

2023-09-04 Thread Di Zhao OS via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Thursday, August 31, 2023 8:23 PM
> To: Di Zhao OS 
> Cc: Jeff Law ; Martin Jambor ; gcc-
> patc...@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Wed, Aug 30, 2023 at 11:33 AM Di Zhao OS
>  wrote:
> >
> > Hello Richard,
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Tuesday, August 29, 2023 7:11 PM
> > > To: Di Zhao OS 
> > > Cc: Jeff Law ; Martin Jambor ;
> gcc-
> > > patc...@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -Original Message-
> > > > > From: Richard Biener 
> > > > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > > > To: Di Zhao OS 
> > > > > Cc: Jeff Law ; Martin Jambor ;
> > > gcc-
> > > > > patc...@gcc.gnu.org
> > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> reassoc
> > > to
> > > > > reduce cross backedge FMA
> > > > >
> > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Richard Biener 
> > > > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > > > To: Jeff Law ; Martin Jambor
> 
> > > > > > > Cc: Di Zhao OS ; gcc-
> > > patc...@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> > > reassoc
> > > > > to
> > > > > > > reduce cross backedge FMA
> > > > > > >
> > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-
> July/624893.html)
> > > > > > > > >
> > > > > > > > > 1. Add testcases for the problem. For an op list in the form
> of
> > > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > > > After widening_mul the result looks like:
> > > > > > > > >
> > > > > > > > > _1 = .FMA(a, b, acc_0);
> > > > > > > > > acc_1 = .FMA(c, d, _1);
> > > > > > > > >
> > > > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > > > widening_mul's result was like:
> > > > > > > > >
> > > > > > > > > _1 = a * b;
> > > > > > > > > _2 = .FMA (c, d, _1);
> > > > > > > > > acc_1 = acc_0 + _2;
> > > > > > >
> > > > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > > > on the add operand but only when it is with a multiplication and
> > > > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > > > >
> > > > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > > > other case?
> > > >
> > > > Here's my understanding after consulted our hardware team. For the
> > > > second case, the uarch of some out-of-order processors can calculate
> > > > "_2" of several loops at the same time, since ther

RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

2023-08-30 Thread Di Zhao OS via Gcc-patches
Hello Richard,

> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, August 29, 2023 7:11 PM
> To: Di Zhao OS 
> Cc: Jeff Law ; Martin Jambor ; gcc-
> patc...@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
>  wrote:
> >
> > Hi,
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > To: Di Zhao OS 
> > > Cc: Jeff Law ; Martin Jambor ;
> gcc-
> > > patc...@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -Original Message-
> > > > > From: Richard Biener 
> > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > To: Jeff Law ; Martin Jambor 
> > > > > Cc: Di Zhao OS ; gcc-
> patc...@gcc.gnu.org
> > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> reassoc
> > > to
> > > > > reduce cross backedge FMA
> > > > >
> > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > > > >
> > > > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > After widening_mul the result looks like:
> > > > > > >
> > > > > > > _1 = .FMA(a, b, acc_0);
> > > > > > > acc_1 = .FMA(c, d, _1);
> > > > > > >
> > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > widening_mul's result was like:
> > > > > > >
> > > > > > > _1 = a * b;
> > > > > > > _2 = .FMA (c, d, _1);
> > > > > > > acc_1 = acc_0 + _2;
> > > > >
> > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > on the add operand but only when it is with a multiplication and
> > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > >
> > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > other case?
> >
> > Here's my understanding after consulted our hardware team. For the
> > second case, the uarch of some out-of-order processors can calculate
> > "_2" of several loops at the same time, since there's no dependency
> > among different iterations. While for the first case the next iteration
> > has to wait for the current iteration to finish, so "acc_0"'s value is
> > known. I assume it is also the case in some i386 processors, since I
> > saw the patch "Deferring FMA transformations in tight loops" also
> > changed corresponding files.
> 
> That should be true for all kind of operations, no?  Thus it means
> reassoc should in general associate cross-iteration accumulation
Yes I think both are true.

> last?  Historically we associated those first because that's how the
> vectorizer liked to see them, but I think that's no longer necessary.
> 
> It should be achievable by properly biasing the operand during
> rank computation (don't we already do that?).

The issue is related with the following codes (handling cases with
three operands left):
  /* When there are three operands left, we want
 to make sure the ones that get the double
 binary op are chosen wisely.  */
  int len = ops.length ();
  if (len >= 3 && !has_fma)
swap_ops_for_binary_stmt (ops, len - 3);

  

RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

2023-08-29 Thread Di Zhao OS via Gcc-patches
Hi,

> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, August 29, 2023 4:09 PM
> To: Di Zhao OS 
> Cc: Jeff Law ; Martin Jambor ; gcc-
> patc...@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
>  wrote:
> >
> > Hi,
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > To: Jeff Law ; Martin Jambor 
> > > Cc: Di Zhao OS ; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > >
> > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > Swap the operands so that more FMAs can be generated.
> > > > > After widening_mul the result looks like:
> > > > >
> > > > > _1 = .FMA(a, b, acc_0);
> > > > > acc_1 = .FMA(c, d, _1);
> > > > >
> > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > widening_mul's result was like:
> > > > >
> > > > > _1 = a * b;
> > > > > _2 = .FMA (c, d, _1);
> > > > > acc_1 = acc_0 + _2;
> > >
> > > How can we execute the multiply and the FMA in parallel?  They
> > > depend on each other.  Or is it the uarch can handle dependence
> > > on the add operand but only when it is with a multiplication and
> > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > >
> > > Can you explain in more detail how the uarch executes one vs. the
> > > other case?

Here's my understanding after consulted our hardware team. For the
second case, the uarch of some out-of-order processors can calculate
"_2" of several loops at the same time, since there's no dependency
among different iterations. While for the first case the next iteration
has to wait for the current iteration to finish, so "acc_0"'s value is
known. I assume it is also the case in some i386 processors, since I
saw the patch "Deferring FMA transformations in tight loops" also
changed corresponding files.

> > >
> > > > > If the code fragment is in a loop, some architecture can execute
> > > > > the latter in parallel, so the performance can be much faster than
> > > > > the former. For the small testcase, the performance gap is over
> > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > no special treatment is needed, since the last statement after
> > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > >
> > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > state of FMA candidates.
> > > > >
> > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > run on ampere1. The compile options are:
> > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > >
> > > > > Best regards,
> > > > > Di Zhao
> > > > >
> > > > > 
> > > > >
> > > > >  PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >  * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > >  describe the state of FMA candidates for an op list.
> > > > >  (rewrite_expr_tree_parallel): Changed boolean
> > > > >  parameter to enum type.
&g

RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

2023-08-29 Thread Di Zhao OS via Gcc-patches
Hi,

> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, August 29, 2023 3:41 PM
> To: Jeff Law ; Martin Jambor 
> Cc: Di Zhao OS ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
>  wrote:
> >
> >
> >
> > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > This patch tries to fix the 2% regression in 510.parest_r on
> > > ampere1 in the tracker. (Previous discussion is here:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > >
> > > 1. Add testcases for the problem. For an op list in the form of
> > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > Swap the operands so that more FMAs can be generated.
> > > After widening_mul the result looks like:
> > >
> > > _1 = .FMA(a, b, acc_0);
> > > acc_1 = .FMA(c, d, _1);
> > >
> > > While previously (before the "Handle FMA friendly..." patch),
> > > widening_mul's result was like:
> > >
> > > _1 = a * b;
> > > _2 = .FMA (c, d, _1);
> > > acc_1 = acc_0 + _2;
> 
> How can we execute the multiply and the FMA in parallel?  They
> depend on each other.  Or is it the uarch can handle dependence
> on the add operand but only when it is with a multiplication and
> not a FMA in some better ways?  (I'd doubt so much complexity)
> 
> Can you explain in more detail how the uarch executes one vs. the
> other case?
> 
> > > If the code fragment is in a loop, some architecture can execute
> > > the latter in parallel, so the performance can be much faster than
> > > the former. For the small testcase, the performance gap is over
> > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > much as possible. (If we are rewriting the op list into parallel,
> > > no special treatment is needed, since the last statement after
> > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > >
> > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > backedge dependency. Added new enum fma_state to describe the
> > > state of FMA candidates.
> > >
> > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > run on ampere1. The compile options are:
> > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > >
> > > Best regards,
> > > Di Zhao
> > >
> > > 
> > >
> > >  PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >  * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > >  describe the state of FMA candidates for an op list.
> > >  (rewrite_expr_tree_parallel): Changed boolean
> > >  parameter to enum type.
> > >  (result_feeds_back_from_phi_p): New function to check
> > >  for cross backedge dependency.
> > >  (rank_ops_for_fma): Return enum fma_state. Added new
> > >  parameter.
> > >  (reassociate_bb): If there's backedge dependency in an
> > >  op list, swap the operands before rewrite_expr_tree.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >  * gcc.dg/pr110279.c: New test.
> > Not a review, but more of a question -- isn't this transformation's
> > profitability uarch sensitive.  ie, just because it's bad for a set of
> > aarch64 uarches, doesn't mean it's bad everywhere.
> >
> > And in general we shy away from trying to adjust gimple code based on
> > uarch preferences.
> >
> > It seems the right place to do this is gimple->rtl expansion.
> 
> Another comment is that FMA forming has this deferring code which I
> think deals exactly with this kind of thing?  CCing Martin who did this
> work based on AMD uarchs also not wanting cross-loop dependences
> on FMAs (or so).  In particular I see
> 
>   if (fma_state.m_deferring_p
>   && fma_state.m_initial_phi)
> {
>   gcc_checking_assert (fma_state.m_last_result);
>   if (!last_fma_candidate_feeds_initial_phi (_state,
>  _last_result_set))
> cancel_fma_deferring (_state);
> 
> and I think code to avoid FMAs in other/related cases should be here
> as well, like avoid forming back-to-back FMAs.

The changes in this patch is controlled by "param_avoid_fma_max_bits", so
I think it should only affect architectures with similar behavior. (The
parameter was added in a previous patch "Deferring FMA transformations
in tight loops", which seems to be dealing with the same issue.)

> Richard.
> 
> > Jeff

Thanks,
Di Zhao


[PATCH] alias-analyis: try to find ADDR_EXPR for SSA_NAME ptr

2023-08-28 Thread Di Zhao OS via Gcc-patches
This patch tries to improve alias-analysis between an SSA_NAME and
a declaration a little. For a case like:

int array1[10], array2[10];
ptr1 = array1 + x;
ptr2 = ptr1 + y;

, *ptr2 should not alias with array2.

If we can't disambiguate from points-to information, this patch
tries to find a determined ADDR_EXPR following the defining
statements and then disambiguate by compare_base_decls. On spec2017
502.gcc, there are several thousands of new non-aliasing relation
found. (No obvious improvements or regressions found, though.) 

Bootstrapped and tested on aarch64-unknown-linux-gnu.

Thanks,
Di Zhao


gcc/ChangeLog:

* tree-ssa-alias.cc (ptr_deref_may_alias_decl_p): try
to find ADDR_EXPR for SSA_NAME ptrs.

---
 gcc/tree-ssa-alias.cc | 42 ++
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index cf38fe506a8..a6fe1e7b227 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -271,6 +271,38 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl)
   return ptr_deref_may_alias_decl_p (ptr, decl);
 }
 
+  if (TREE_CODE (ptr) == SSA_NAME)
+{
+  /* First disambiguate from points-to information.  */
+  pi = SSA_NAME_PTR_INFO (ptr);
+  if (pi && !pt_solution_includes (>pt, decl))
+   return false;
+
+  /* Try to find an ADDR_EXPR by walking the defining statements, so we can
+probably disambiguate from compare_base_decls.  */
+  gimple *def_stmt;
+  while ((def_stmt = SSA_NAME_DEF_STMT (ptr)))
+   {
+ if (is_gimple_assign (def_stmt)
+ && gimple_assign_rhs_code (def_stmt) == POINTER_PLUS_EXPR)
+   {
+ ptr = gimple_assign_rhs1 (def_stmt);
+ if (TREE_CODE (ptr) != SSA_NAME)
+   break;
+   }
+ /* See if we can find a certain defining source.  */
+ else if (gimple_code (def_stmt) == GIMPLE_PHI
+  && gimple_phi_num_args (def_stmt) == 1)
+   {
+ ptr = PHI_ARG_DEF (def_stmt, 0);
+ if (TREE_CODE (ptr) != SSA_NAME)
+   break;
+   }
+ else
+   break;
+   }
+}
+
   /* ADDR_EXPR pointers either just offset another pointer or directly
  specify the pointed-to set.  */
   if (TREE_CODE (ptr) == ADDR_EXPR)
@@ -279,7 +311,7 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl)
   if (base
  && (TREE_CODE (base) == MEM_REF
  || TREE_CODE (base) == TARGET_MEM_REF))
-   ptr = TREE_OPERAND (base, 0);
+   return ptr_deref_may_alias_decl_p (TREE_OPERAND (base, 0), decl);
   else if (base
   && DECL_P (base))
return compare_base_decls (base, decl) != 0;
@@ -294,13 +326,7 @@ ptr_deref_may_alias_decl_p (tree ptr, tree decl)
   if (!may_be_aliased (decl))
 return false;
 
-  /* If we do not have useful points-to information for this pointer
- we cannot disambiguate anything else.  */
-  pi = SSA_NAME_PTR_INFO (ptr);
-  if (!pi)
-return true;
-
-  return pt_solution_includes (>pt, decl);
+  return true;
 }
 
 /* Return true if dereferenced PTR1 and PTR2 may alias.
-- 
2.25.1


[PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

2023-08-28 Thread Di Zhao OS via Gcc-patches
This patch tries to fix the 2% regression in 510.parest_r on
ampere1 in the tracker. (Previous discussion is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)

1. Add testcases for the problem. For an op list in the form of
"acc = a * b + c * d + acc", currently reassociation doesn't
Swap the operands so that more FMAs can be generated.
After widening_mul the result looks like:

   _1 = .FMA(a, b, acc_0);
   acc_1 = .FMA(c, d, _1);

While previously (before the "Handle FMA friendly..." patch),
widening_mul's result was like:

   _1 = a * b;
   _2 = .FMA (c, d, _1);
   acc_1 = acc_0 + _2;

If the code fragment is in a loop, some architecture can execute
the latter in parallel, so the performance can be much faster than
the former. For the small testcase, the performance gap is over
10% on both ampere1 and neoverse-n1. So the point here is to avoid
turning the last statement into FMA, and keep it a PLUS_EXPR as
much as possible. (If we are rewriting the op list into parallel,
no special treatment is needed, since the last statement after
rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)

2. Function result_feeds_back_from_phi_p is to check for cross
backedge dependency. Added new enum fma_state to describe the
state of FMA candidates.

With this patch, there's a 3% improvement in 510.parest_r 1-copy
run on ampere1. The compile options are: 
"-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".

Best regards,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-reassoc.cc (enum fma_state): New enum to
describe the state of FMA candidates for an op list.
(rewrite_expr_tree_parallel): Changed boolean 
parameter to enum type.
(result_feeds_back_from_phi_p): New function to check
for cross backedge dependency.
(rank_ops_for_fma): Return enum fma_state. Added new
parameter.
(reassociate_bb): If there's backedge dependency in an
op list, swap the operands before rewrite_expr_tree.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279.c: New test.


0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch
Description: 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch


RE: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

2023-08-18 Thread Di Zhao OS via Gcc-patches
Hi,

A few updates to the patch:

1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete
   FMA chain, since the regression is obvious only in this case.

2. Added new testcase.

Thanks,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
new parameter collect_lhs.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter collect_lhs.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Moved from .cc.
* tree-ssa-reassoc.cc (enum fma_state): Defined enum to
describe the state of FMA candidates for a list of
operands.
(rewrite_expr_tree_parallel): Changed boolean parameter
to enum type.
(has_nested_fma_p): New function to check for nested FMA
on given multiplication statement.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested
FMAs are found.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279-1.c: New test.
* gcc.dg/pr110279-2.c: New test.
* gcc.dg/pr110279-3.c: New test.

> -Original Message-
> From: Di Zhao OS
> Sent: Thursday, August 10, 2023 12:53 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener 
> Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
> 
> Hi,
> 
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
> 
> Other changes:
> 
> 1. Added new testcases for the "nested" FMA issue. For the
>following code:
> 
>   tmp1 = a + c * c + d * d + x * y;
>   tmp2 = x * tmp1;
>   result += (a + c + d + tmp2);
> 
>, when "tmp1 = ..." is not rewritten, tmp1 will be result of
>an FMA, and there will be a list of consecutive FMAs:
> 
>   _1 = .FMA (c, c, a_39);
>   _2 = .FMA (d, d, _1);
>   tmp1 = .FMA (x, y, _2);
>   _3 = .FMA (tmp1, x, d);
>   ...
> 
>If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>of a PLUS_EXPR between FMAs:
> 
>   _1 = .FMA (c, c, a_39);
>   _2 = x * y;
>   _3 = .FMA (d, d, _2);
>tmp1 = _3 + _1;
>_4 = .FMA (tmp1, x, d);
>   ...
> 
>It seems the register pressure of the latter is higher than
>the former. On the test machines we have (including Ampere1,
>Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>parallel, the run time all increased significantly. In
>contrast, when "tmp1" is not the 1st or 2nd operand of another
>FMA (pr110279-1.c), rewriting it results in better performance.
>(I'll also append the testcases in the bug tracker.)
> 
> 2. Enhanced checking for nested FMA by: 1) Modified
>convert_mult_to_fma so it can return multiple LHS.  2) Check
>NEGATE_EXPRs for nested FMA.
> 
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )
> 
> Bootstrapped and regression tested on x86_64-linux-gnu.
> 
> Thanks,
> Di Zhao



0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch
Description: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch


[PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

2023-08-09 Thread Di Zhao OS via Gcc-patches
Hi,

The previous version of this patch tries to solve two problems
at the same time. For better clarity, I'll separate them and 
only deal with the "nested" FMA in this version. I plan to
propose another patch in avoiding bad shaped FMA (deferring FMA).

Other changes:

1. Added new testcases for the "nested" FMA issue. For the
   following code:

tmp1 = a + c * c + d * d + x * y;
tmp2 = x * tmp1;
result += (a + c + d + tmp2);

   , when "tmp1 = ..." is not rewritten, tmp1 will be result of
   an FMA, and there will be a list of consecutive FMAs: 

_1 = .FMA (c, c, a_39);
_2 = .FMA (d, d, _1);
tmp1 = .FMA (x, y, _2);
_3 = .FMA (tmp1, x, d);
...
   
   If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
   of a PLUS_EXPR between FMAs:

_1 = .FMA (c, c, a_39);
_2 = x * y;
_3 = .FMA (d, d, _2);
 tmp1 = _3 + _1;
 _4 = .FMA (tmp1, x, d);
...

   It seems the register pressure of the latter is higher than
   the former. On the test machines we have (including Ampere1,
   Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
   parallel, the run time all increased significantly. In
   contrast, when "tmp1" is not the 1st or 2nd operand of another
   FMA (pr110279-1.c), rewriting it results in better performance.
   (I'll also append the testcases in the bug tracker.)

2. Enhanced checking for nested FMA by: 1) Modified
   convert_mult_to_fma so it can return multiple LHS.  2) Check
   NEGATE_EXPRs for nested FMA.

(I think maybe this can be further refined by enabling rewriting
to parallel for very long op list. )

Bootstrapped and regression tested on x86_64-linux-gnu.

Thanks,
Di Zhao



PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
new parameter collect_lhs.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter collect_lhs.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Moved from .cc.
* tree-ssa-reassoc.cc (enum fma_state): Defined enum to
describe the state of FMA candidates for a list of
operands.
(rewrite_expr_tree_parallel): Changed boolean parameter
to enum type.
(has_nested_fma_p): New function to check for nested FMA
on given multiplication statement.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested
FMAs are found.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279-1.c: New test.
* gcc.dg/pr110279-2.c: New test.


tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch


RE: [PATCH] Change fma_reassoc_width tuning for ampere1

2023-07-29 Thread Di Zhao OS via Gcc-patches
Cherry-picked this to gcc-13.

Thanks,
Di Zhao

> -Original Message-
> From: Richard Sandiford 
> Sent: Monday, June 26, 2023 10:28 PM
> To: Philipp Tomsich 
> Cc: Di Zhao OS via Gcc-patches ; Di Zhao OS
> 
> Subject: Re: [PATCH] Change fma_reassoc_width tuning for ampere1
> 
> Philipp Tomsich  writes:
> > Richard,
> >
> > OK for backport to GCC-13?
> 
> Yeah, OK for GCC 13 too.
> 
> Thanks,
> Richard
> 
> > Thanks,
> > Philipp.
> >
> > On Thu, 22 Jun 2023 at 16:18, Richard Sandiford via Gcc-patches
> >  wrote:
> >>
> >> Di Zhao OS via Gcc-patches  writes:
> >> > This patch enables reassociation of floating-point additions on ampere1.
> >> > This brings about 1% overall benefit on spec2017 fprate cases. (There
> >> > are minor regressions in 510.parest_r and 508.namd_r, analyzed here:
> >> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279 .)
> >> >
> >> > Bootstrapped and tested on aarch64-unknown-linux-gnu. Is this OK for
> trunk?
> >> >
> >> > Thanks,
> >> > Di Zhao
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > * config/aarch64/aarch64.cc: Change fma_reassoc_width for
> ampere1
> >>
> >> Thanks, pushed to trunk.
> >>
> >> Richard
> >>
> >> > ---
> >> > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> >> > index d16565b5581..301c9f6c0cd 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -1927,7 +1927,7 @@ static const struct tune_params ampere1_tunings =
> >> >"32:12",   /* loop_align.  */
> >> >2, /* int_reassoc_width.  */
> >> >4, /* fp_reassoc_width.  */
> >> > -  1, /* fma_reassoc_width.  */
> >> > +  4, /* fma_reassoc_width.  */
> >> >2, /* vec_reassoc_width.  */
> >> >2, /* min_div_recip_mul_sf.  */
> >> >2, /* min_div_recip_mul_df.  */


[PATCH v2] tree-optimization/110279- Check for nested FMA chains in reassoc

2023-07-10 Thread Di Zhao OS via Gcc-patches
Attached is an updated version of the patch.

Based on Philipp's review, some changes:

1. Defined new enum fma_state to describe the state of FMA candidates
   for a list of operands. (Since the tests seems simple after the
   change, I didn't add predicates on it.)
2. Changed return type of convert_mult_to_fma_1 and convert_mult_to_fma
   to tree, to remove the in/out parameter.
3. Added description of return value values of rank_ops_for_fma.

---
gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter
check_only_p. Changed return type to tree.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter check_only_p. Changed
return type to tree.
* tree-ssa-math-opts.h (struct fma_transformation_info): Moved from .cc.
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Add function decl.
* tree-ssa-reassoc.cc (enum fma_state): Defined new enum to describe
the state of FMA candidates for a list of operands.
(rewrite_expr_tree_parallel): Changed boolean parameter to enum type.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.

Thanks,
Di Zhao 




0001-Check-for-nested-FMA-chains-in-reassoc.patch
Description: 0001-Check-for-nested-FMA-chains-in-reassoc.patch


[PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc

2023-07-07 Thread Di Zhao OS via Gcc-patches
Update the patch so it can apply.

Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast 
-flto",
the improvements of 1-copy run are:

Ampere1:
508.namd_r  4.26% 
510.parest_r2.55%
Overall 0.54%
Intel Xeon:
503.bwaves_r1.3%
508.namd_r  1.58%
overall 0.42%


Thanks,
Di Zhao


> -Original Message-
> From: Di Zhao OS
> Sent: Friday, June 16, 2023 4:51 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in
> reassoc
> 
> This patch is to fix the regressions found in SPEC2017 fprate cases
>  on aarch64.
> 
> 1. Reused code in pass widening_mul to check for nested FMA chains
>  (those connected by MULT_EXPRs), since re-writing to parallel
>  generates worse codes.
> 
> 2. Avoid re-arrange to produce less FMA chains that can be slow.
> 
> Tested on ampere1 and neoverse-n1, this fixed the regressions in
> 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
> on x86 machines we have, I'd like to know what do you think of this.
> 
> (Previously I tried to improve things with FMA by adding a widening_mul
> pass before reassoc2 for it's easier to recognize different patterns
> of FMA chains and decide whether to split them. But I suppose handling
> them all in reassoc pass is more efficient.)
> 
> Thanks,
> Di Zhao
> 
> ---
> gcc/ChangeLog:
> 
> * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
> Support new mode that merely do the checking.
> (struct fma_transformation_info): Moved to header.
> (class fma_deferring_state): Moved to header.
> (convert_mult_to_fma): Add new parameter.
> * tree-ssa-math-opts.h (struct fma_transformation_info):
> (class fma_deferring_state): Moved from .cc.
> (convert_mult_to_fma): Add function decl.
> * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
> (rank_ops_for_fma): Return -1 if nested FMAs are found.
> (reassociate_bb): Avoid rewriting to parallel if nested FMAs are
> found.



0001-Check-for-nested-FMA-chains-in-reassoc.patch
Description: 0001-Check-for-nested-FMA-chains-in-reassoc.patch


[PATCH] Change fma_reassoc_width tuning for ampere1

2023-06-19 Thread Di Zhao OS via Gcc-patches
This patch enables reassociation of floating-point additions on ampere1.
This brings about 1% overall benefit on spec2017 fprate cases. (There
are minor regressions in 510.parest_r and 508.namd_r, analyzed here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279 .)

Bootstrapped and tested on aarch64-unknown-linux-gnu. Is this OK for trunk?

Thanks,
Di Zhao

gcc/ChangeLog:

* config/aarch64/aarch64.cc: Change fma_reassoc_width for ampere1
---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d16565b5581..301c9f6c0cd 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1927,7 +1927,7 @@ static const struct tune_params ampere1_tunings =
   "32:12", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
-  1,   /* fma_reassoc_width.  */
+  4,   /* fma_reassoc_width.  */
   2,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
-- 
2.25.1




[PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc

2023-06-16 Thread Di Zhao OS via Gcc-patches
This patch is to fix the regressions found in SPEC2017 fprate cases
 on aarch64.

1. Reused code in pass widening_mul to check for nested FMA chains
 (those connected by MULT_EXPRs), since re-writing to parallel
 generates worse codes.

2. Avoid re-arrange to produce less FMA chains that can be slow.

Tested on ampere1 and neoverse-n1, this fixed the regressions in
508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
on x86 machines we have, I'd like to know what do you think of this.

(Previously I tried to improve things with FMA by adding a widening_mul
pass before reassoc2 for it's easier to recognize different patterns
of FMA chains and decide whether to split them. But I suppose handling
them all in reassoc pass is more efficient.)

Thanks,
Di Zhao

---
gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
Support new mode that merely do the checking. 
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Add new parameter.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Add function decl.
* tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
(rank_ops_for_fma): Return -1 if nested FMAs are found.
(reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.



pr110279-Check-for-nested-FMA-chains-in-reassoc.diff
Description: pr110279-Check-for-nested-FMA-chains-in-reassoc.diff


RE: [PATCH] Handle FMA friendly in reassoc pass

2023-06-06 Thread Di Zhao OS via Gcc-patches
Hello Lili Cui,

Since I'm also trying to improve this lately, I've tested your patch on
several aarch64 machines we have, including neoverse-n1 and ampere1
architectures. However, I haven't reproduced the 6.00% improvement of 
503.bwaves_r single copy run you mentioned. Could you share more information
about the aarch64 CPU and compile options you tested? The option
I'm using is "-Ofast", with or without "--param avoid-fma-max-bits=512".

Additionally, we found some spec2017 cases with regressions, including
4% on 527.cam4_r (neoverse-n1).

> -Original Message-
> From: Gcc-patches  bounces+dizhao=os.amperecomputing@gcc.gnu.org> On Behalf Of Cui, Lili via
> Gcc-patches
> Sent: Thursday, May 25, 2023 7:30 AM
> To: gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; li...@linux.ibm.com; Lili Cui
> 
> Subject: [PATCH] Handle FMA friendly in reassoc pass
> 
> From: Lili Cui 
> 
> Make some changes in reassoc pass to make it more friendly to fma pass later.
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.
> 
> There are mainly two changes
> 1. Put no-mult ops and mult ops alternately at the end of the queue, which is
> conducive to generating more fma and reducing the loss of FMA when breaking
> the chain.
> 2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
> chains according to the given correlation width, keeping the FMA chance as
> much as possible.
> 
> With the patch applied
> 
> On ICX:
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy .
> 503.bwaves_r   : Improved by  0.60% for single copy .
> 507.cactuBSSN_r: Improved by  1.10% for single copy .
> 519.lbm_r  : Improved by  2.21% for single copy .
> no measurable changes for other benchmarks.
> 
> On aarch64
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy.
> 503.bwaves_r   : Improved by 6.00% for single-copy.
> no measurable changes for other benchmarks.
> 
> TEST1:
> 
> float
> foo (float a, float b, float c, float d, float *e)
> {
>return  *e  + a * b + c * d ;
> }
> 
> For "-Ofast -mfpmath=sse -mfma" GCC generates:
> vmulss  %xmm3, %xmm2, %xmm2
> vfmadd132ss %xmm1, %xmm2, %xmm0
> vaddss  (%rdi), %xmm0, %xmm0
> ret
> 
> With this patch GCC generates:
> vfmadd213ss   (%rdi), %xmm1, %xmm0
> vfmadd231ss   %xmm2, %xmm3, %xmm0
> ret
> 
> TEST2:
> 
> for (int i = 0; i < N; i++)
> {
>   a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i]
> + m[i]* o[i] + p[i];
> }
> 
> For "-Ofast -mfpmath=sse -mfma"  GCC generates:
>   vmovapd e(%rax), %ymm4
>   vmulpd  d(%rax), %ymm4, %ymm3
>   addq$32, %rax
>   vmovapd c-32(%rax), %ymm5
>   vmovapd j-32(%rax), %ymm6
>   vmulpd  h-32(%rax), %ymm6, %ymm2
>   vmovapd a-32(%rax), %ymm6
>   vaddpd  p-32(%rax), %ymm6, %ymm0
>   vmovapd g-32(%rax), %ymm7
>   vfmadd231pd b-32(%rax), %ymm5, %ymm3
>   vmovapd o-32(%rax), %ymm4
>   vmulpd  m-32(%rax), %ymm4, %ymm1
>   vmovapd l-32(%rax), %ymm5
>   vfmadd231pd f-32(%rax), %ymm7, %ymm2
>   vfmadd231pd k-32(%rax), %ymm5, %ymm1
>   vaddpd  %ymm3, %ymm0, %ymm0
>   vaddpd  %ymm2, %ymm0, %ymm0
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L4
>   vzeroupper
>   ret
> 
> with this patch applied GCC breaks the chain with width = 2 and generates 6
> fma:
> 
>   vmovapd a(%rax), %ymm2
>   vmovapd c(%rax), %ymm0
>   addq$32, %rax
>   vmovapd e-32(%rax), %ymm1
>   vmovapd p-32(%rax), %ymm5
>   vmovapd g-32(%rax), %ymm3
>   vmovapd j-32(%rax), %ymm6
>   vmovapd l-32(%rax), %ymm4
>   vmovapd o-32(%rax), %ymm7
>   vfmadd132pd b-32(%rax), %ymm2, %ymm0
>   vfmadd132pd d-32(%rax), %ymm5, %ymm1
>   vfmadd231pd f-32(%rax), %ymm3, %ymm0
>   vfmadd231pd h-32(%rax), %ymm6, %ymm1
>   vfmadd231pd k-32(%rax), %ymm4, %ymm0
>   vfmadd231pd m-32(%rax), %ymm7, %ymm1
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L2
>   vzeroupper
>   ret
> 
> gcc/ChangeLog:
> 
>   PR gcc/98350
>   * tree-ssa-reassoc.cc
>   (rewrite_expr_tree_parallel): Rewrite this function.
>   (rank_ops_for_fma): New.
>   (reassociate_bb): Handle new function.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR gcc/98350
>   * gcc.dg/pr98350-1.c: New test.
>   * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31 
>  gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
>  gcc/tree-ssa-reassoc.cc  | 256 +--
>  3 files changed, 215 insertions(+), 83 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
> 
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-
> 1.c
> new file mode 100644
> 

RE: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass

2023-05-30 Thread Di Zhao OS via Gcc-patches
Sorry I've missed the recent updates on trunk regarding handling FMA.
I'll measure again if something in this still helps.

Thanks,
Di Zhao

> -Original Message-
> From: Di Zhao OS
> Sent: Friday, May 26, 2023 3:15 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
> 
> As GCC's reassociation pass does not have knowledge of FMA, when
> transforming expression lists to parallel, it reduces the
> opportunities to generate FMAs. Currently there's a workaround
> on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
> that is, to disable the parallelization with floating-point additions.
> However, this approach may cause regressions. For example, in the
> code below there are only floating-point additions when calculating
> "result += array[j]", and rewriting to parallel is better:
> 
> // Compile with -Ofast on aarch64
> float foo (int n, float in)
> {
>   float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
>   float result = 0.0;
>   for (int i = 0; i < n; i++)
> {
>   if (i % 10)
> for (unsigned j = 0; j < 8; j++)
>   array[j] *= in;
> 
>   for (unsigned j = 0; j < 8; j++)
>result += array[j];
> }
>   return result;
> }
> 
> To improve this, one option is to count the number of MUL_EXPRs in an
> operator list before rewriting to parallel, and allow the rewriting
> when there's none (or 1 MUL_EXPR). This is simple and unlikely to
> introduce regressions. However it lacks flexibility and can not handle
> more general cases.
> 
> Here's an attempt to address the issue more generally.
> 
> 1. Added an additional widening_mul pass before the original reassoc2
> pass. The new pass is limited to only insert FMA, and leave other
> operations like convert_mult_to_widen to the old late widening_mul pass,
> in case other optimizations between the two passes could be hindered.
> 
> 2. On some platforms, for a very long FMA chain, rewriting to parallel
> can be faster. Extended the original "deferring" logic so that all
> conversions to FMA can be deferred. Introduced a new parameter
> op-count-prefer-reassoc to control this behavior.
> 
> 3. Additionally, the new widening_mul pass calls execute_reassoc first,
> to avoid losing opportunities such as folding constants and
> undistributing.
> 
> However, changing the sequence of generating FMA and reassociation may
> expose more FMA chains that are slow (see commit 4a0d0ed2).
> To reduce possible regressions, improved handling the slow FMA chain by:
> 
> 1. Modified result_of_phi to support checking an additional FADD/FMUL.
> 
> 2. On some CPUs, rather than removing the whole FMA chain, only skipping
> a few candidates may generate faster code. Added new parameter
> fskip-fma-heuristic to control this behavior.
> 
> This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.
> 
> Thanks,
> Di Zhao



[RFC][PATCH] Improve generating FMA by adding a widening_mul pass

2023-05-26 Thread Di Zhao OS via Gcc-patches
As GCC's reassociation pass does not have knowledge of FMA, when
transforming expression lists to parallel, it reduces the
opportunities to generate FMAs. Currently there's a workaround
on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
that is, to disable the parallelization with floating-point additions.
However, this approach may cause regressions. For example, in the
code below there are only floating-point additions when calculating
"result += array[j]", and rewriting to parallel is better:

// Compile with -Ofast on aarch64
float foo (int n, float in)
{
  float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
  float result = 0.0;
  for (int i = 0; i < n; i++)
{
  if (i % 10)
for (unsigned j = 0; j < 8; j++)
  array[j] *= in;

  for (unsigned j = 0; j < 8; j++)
   result += array[j];
}
  return result;
}

To improve this, one option is to count the number of MUL_EXPRs in an
operator list before rewriting to parallel, and allow the rewriting
when there's none (or 1 MUL_EXPR). This is simple and unlikely to
introduce regressions. However it lacks flexibility and can not handle
more general cases.

Here's an attempt to address the issue more generally.

1. Added an additional widening_mul pass before the original reassoc2
pass. The new pass is limited to only insert FMA, and leave other
operations like convert_mult_to_widen to the old late widening_mul pass,
in case other optimizations between the two passes could be hindered.

2. On some platforms, for a very long FMA chain, rewriting to parallel
can be faster. Extended the original "deferring" logic so that all
conversions to FMA can be deferred. Introduced a new parameter
op-count-prefer-reassoc to control this behavior.

3. Additionally, the new widening_mul pass calls execute_reassoc first,
to avoid losing opportunities such as folding constants and
undistributing.

However, changing the sequence of generating FMA and reassociation may
expose more FMA chains that are slow (see commit 4a0d0ed2).
To reduce possible regressions, improved handling the slow FMA chain by:

1. Modified result_of_phi to support checking an additional FADD/FMUL.

2. On some CPUs, rather than removing the whole FMA chain, only skipping
a few candidates may generate faster code. Added new parameter
fskip-fma-heuristic to control this behavior.

This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.

Thanks,
Di Zhao



0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch
Description: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch


PING: [PATCH v6] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2022-11-15 Thread Di Zhao OS via Gcc-patches
Hi,

I saw that Stage 1 of GCC 13 development is just ended. So is this
considered? Or should I bring this up when general development is
reopened?


Thanks,
Di Zhao


> -Original Message-
> From: Di Zhao OS
> Sent: Tuesday, October 25, 2022 8:18 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener 
> Subject: [PATCH v6] tree-optimization/101186 - extend FRE with "equivalence
> map" for condition prediction
> 
> Sorry for the late update. I've been on a vacation and then I
> spent some time updating and verifying the patch.
> 
> Attached is a new version of the patch. There are some changes:
> 
> 1. Store equivalences in a vn_pval chain in vn_ssa_aux, rather than
>in the expression hash table. (Following Richard's suggestion.)
> 2. Extracted insert_single_predicated_value function.
> 3. Simplify record_equiv_from_prev_phi a bit.
> 4. Changed some of the functions' names and tried to improve the
>comments a little.
> 
> Current status of the new testcases in the patch:
> 
> ssa-fre-200.c Can also be optimized by evrp
> ssa-fre-201.c Not optimized in trunk.
> ssa-fre-202.c foo() can be removed by evrp; while x + b is not
>   folded.
> ssa-pre-34.c  Not optimized in trunk.
> 
> Initially, this patch is motivated to remove the unreachable codes
> in case like ssa-pre-34.c, in which we need to use equivalence
> relation produced from a preceding condition for another condition.
> VRP didn't optimize that because it needs jump threading to make
> the relation valid at the second condition.
> 
> After browsing the mechanisms of VRP and FRE, it seems to me there
> are two options: 1) Teach VRP to identify related but not threaded
> conditions. That might require introducing value-numbering into VRP
> to detect common expressions, and I think is too much for this.
> 2) Introduce temporary equivalence in sccvn, which I thought would
> change less on current code. (And along the reviews and updating
> patch I see how ad-hoc it was.)
> 
> I saw from the talk about VN there's plan to replace predicated
> values by ranger. So how does it goes? Is there something I can help
> with? (For the case ssa-pre-34.c, I think maybe it still needs the
> predicated-value support, to lookup related conditional expressions.)
> 
> Below are about questions in the last review:
> 
> > >  /* Valid hashtables storing information we have proven to be
> > > correct.  */
> > > @@ -490,9 +492,9 @@ VN_INFO (tree name)
> > >   nary->predicated_values = 0;
> > >   nary->u.result = boolean_true_node;
> > >   vn_nary_op_insert_into (nary, valid_info->nary);
> > > - gcc_assert (nary->unwind_to == NULL);
> >
> > why's that?  doesn't this mean unwinding will be broken?
> 
> Previously, predicate "argument_x == NULL" or "argument_x != NULL"
> is always new here (because argument_x's VN is just inserted.)
> But with the patch, there can be slot for "argument_x == NULL"
> or "argument_x != NULL" already. It won't break unwinding as the
> new value is not linked to the unwind-chain.
> 
> >
> > >   /* Also do not link it into the undo chain.  */
> > >   last_inserted_nary = nary->next;
> > > + /* There could be a predicate already.  */
> > >   nary->next = (vn_nary_op_t)(void *)-1;
> > >   nary = alloc_vn_nary_op_noinit (2, _tables_insert_obstack);
> > >   init_vn_nary_op_from_pieces (nary, 2, EQ_EXPR,
> 
> > >  /* Compute and return the hash value for nary operation VBO1.  */
> > >
> > >  hashval_t
> > > @@ -4226,6 +4342,9 @@ init_vn_nary_op_from_stmt (vn_nary_op_t vno, gassign
> *stmt)
> > >for (i = 0; i < vno->length; ++i)
> > >   vno->op[i] = gimple_op (stmt, i + 1);
> > >  }
> > > +  /* Insert and lookup N-ary results by the operands' equivalence heads.
> */
> > > +  if (gimple_bb (stmt))
> > > +lookup_equiv_heads (vno->length, vno->op, vno->op, gimple_bb (stmt));
> >
> > That seems like the wrong place, the function didn't even valueize before.
> 
> To utilize temp-equivalences and get more simplified result, n-ary
> expressions should be always inserted and lookup by the operands'
> equivalence heads. So practically all the places
> init_vn_nary_op_from_stmt is used, lookup_equiv_heads (changed to
> get_equiv_heads) should be called. As I haven't found better place
> to put that, I just left it here in the patch..
> 
> > >  visit_nary_op (tree lhs, gassign *stmt)
> > >  {
> > >vn_nary_op_t vnresult;
> > > -  tree result = vn_nary_op_lookup_stmt (stmt, );
> > > -  if (! result && vnresult)
> > > +  unsigned length = vn_nary_length_from_stmt (stmt);
> > > +  vn_nary_op_t vno
> > > += XALLOCAVAR (struct vn_nary_op_s, sizeof_vn_nary_op (length));
> > > +  init_vn_nary_op_from_stmt (vno, stmt);
> > > +  tree result = NULL_TREE;
> > > +  /* Try to get a simplified result.  */
> > > +  /* Do not simplify variable used in PHI at loop exit, or
> > > + simplify_peeled_chrec/constant_after_peeling may miss the loop.  */
> > > +  gimple *use_stmt;
> > > +  

[PATCH v6] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2022-10-24 Thread Di Zhao OS via Gcc-patches
Sorry for the late update. I've been on a vacation and then I
spent some time updating and verifying the patch.

Attached is a new version of the patch. There are some changes:

1. Store equivalences in a vn_pval chain in vn_ssa_aux, rather than
   in the expression hash table. (Following Richard's suggestion.)
2. Extracted insert_single_predicated_value function.
3. Simplify record_equiv_from_prev_phi a bit.
4. Changed some of the functions' names and tried to improve the
   comments a little.

Current status of the new testcases in the patch:

ssa-fre-200.c   Can also be optimized by evrp
ssa-fre-201.c   Not optimized in trunk.
ssa-fre-202.c   foo() can be removed by evrp; while x + b is not
folded.
ssa-pre-34.cNot optimized in trunk.

Initially, this patch is motivated to remove the unreachable codes
in case like ssa-pre-34.c, in which we need to use equivalence
relation produced from a preceding condition for another condition.
VRP didn't optimize that because it needs jump threading to make
the relation valid at the second condition.

After browsing the mechanisms of VRP and FRE, it seems to me there
are two options: 1) Teach VRP to identify related but not threaded
conditions. That might require introducing value-numbering into VRP 
to detect common expressions, and I think is too much for this. 
2) Introduce temporary equivalence in sccvn, which I thought would
change less on current code. (And along the reviews and updating
patch I see how ad-hoc it was.)

I saw from the talk about VN there's plan to replace predicated
values by ranger. So how does it goes? Is there something I can help
with? (For the case ssa-pre-34.c, I think maybe it still needs the
predicated-value support, to lookup related conditional expressions.)

Below are about questions in the last review:

> >  /* Valid hashtables storing information we have proven to be
> > correct.  */
> > @@ -490,9 +492,9 @@ VN_INFO (tree name)
> > nary->predicated_values = 0;
> > nary->u.result = boolean_true_node;
> > vn_nary_op_insert_into (nary, valid_info->nary);
> > -   gcc_assert (nary->unwind_to == NULL);
> 
> why's that?  doesn't this mean unwinding will be broken?

Previously, predicate "argument_x == NULL" or "argument_x != NULL"
is always new here (because argument_x's VN is just inserted.)
But with the patch, there can be slot for "argument_x == NULL"
or "argument_x != NULL" already. It won't break unwinding as the
new value is not linked to the unwind-chain.

> 
> > /* Also do not link it into the undo chain.  */
> > last_inserted_nary = nary->next;
> > +   /* There could be a predicate already.  */
> > nary->next = (vn_nary_op_t)(void *)-1;
> > nary = alloc_vn_nary_op_noinit (2, _tables_insert_obstack);
> > init_vn_nary_op_from_pieces (nary, 2, EQ_EXPR,

> >  /* Compute and return the hash value for nary operation VBO1.  */
> >  
> >  hashval_t
> > @@ -4226,6 +4342,9 @@ init_vn_nary_op_from_stmt (vn_nary_op_t vno, gassign 
> > *stmt)
> >for (i = 0; i < vno->length; ++i)
> > vno->op[i] = gimple_op (stmt, i + 1);
> >  }
> > +  /* Insert and lookup N-ary results by the operands' equivalence heads.  
> > */
> > +  if (gimple_bb (stmt))
> > +lookup_equiv_heads (vno->length, vno->op, vno->op, gimple_bb (stmt));
> 
> That seems like the wrong place, the function didn't even valueize before.

To utilize temp-equivalences and get more simplified result, n-ary
expressions should be always inserted and lookup by the operands'
equivalence heads. So practically all the places
init_vn_nary_op_from_stmt is used, lookup_equiv_heads (changed to
get_equiv_heads) should be called. As I haven't found better place
to put that, I just left it here in the patch..

> >  visit_nary_op (tree lhs, gassign *stmt)
> >  {
> >vn_nary_op_t vnresult;
> > -  tree result = vn_nary_op_lookup_stmt (stmt, );
> > -  if (! result && vnresult)
> > +  unsigned length = vn_nary_length_from_stmt (stmt);
> > +  vn_nary_op_t vno
> > += XALLOCAVAR (struct vn_nary_op_s, sizeof_vn_nary_op (length));
> > +  init_vn_nary_op_from_stmt (vno, stmt);
> > +  tree result = NULL_TREE;
> > +  /* Try to get a simplified result.  */
> > +  /* Do not simplify variable used in PHI at loop exit, or
> > + simplify_peeled_chrec/constant_after_peeling may miss the loop.  */
> > +  gimple *use_stmt;
> > +  use_operand_p use_p;
> > +  if (!(single_imm_use (lhs, _p, _stmt)
> > +   && gimple_code (use_stmt) == GIMPLE_PHI
> > +   && single_succ_p (gimple_bb (use_stmt))
> > +   && (single_succ_edge (gimple_bb (use_stmt))->flags & EDGE_DFS_BACK)))
> > +result = fold_const_from_equiv_heads (vno->length, vno->opcode, 
> > vno->op,
> > + vno->type);
> 
> copy propagating conditional equivalences has proved problematic, why
> do this at all when there's no actual simplification?  It's a bit odd that
> we need a special fold_const_from_equiv_heads 

PING^2: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2022-09-07 Thread Di Zhao OS via Gcc-patches
Gentle ping again.

Thanks,
Di Zhao

> -Original Message-
> From: Di Zhao OS
> Sent: Tuesday, July 12, 2022 2:08 AM
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Richard Biener' 
> Subject: PING: [PATCH v5] tree-optimization/101186 - extend FRE with
> "equivalence map" for condition prediction
> 
> Updated the patch in the attachment, so it can apply.
> 
> Thanks,
> Di Zhao
> 
> > -Original Message-
> > From: Di Zhao OS
> > Sent: Sunday, May 29, 2022 11:59 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Richard Biener 
> > Subject: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence
> > map" for condition prediction
> >
> > Hi, attached is a new version of the patch. The changes are:
> > - Skip using temporary equivalences for floating-point values, because
> > folding expressions can generate incorrect values. For example,
> > operations on 0.0 and -0.0 may have different results.
> > - Avoid inserting duplicated back-refs from value-number to predicates.
> > - Disable fre in testsuite/g++.dg/pr83541.C .
> >
> > Summary of the previous versions:
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html
> >
> > Is the patch still considered?
> >
> > Thanks,
> > Di Zhao
> >
> > ---
> >
> > Extend FRE with temporary equivalences.
> >
> > 2022-05-29  Di Zhao  
> >
> > gcc/ChangeLog:
> > PR tree-optimization/101186
> > * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a
> > predicate already).
> > (dominated_by_p_w_unex): Moved upward.
> > (vn_nary_op_get_predicated_value): Moved upward.
> > (is_vn_valid_at_bb): Check if vn_pval is valid at BB.
> > (lookup_equiv_head): Lookup the "equivalence head" of given node.
> > (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes.
> > (vn_tracking_edge): Extracted utility function.
> > (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence
> head"s.
> > (vn_nary_op_insert_into): Insert new value at the front.
> > (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values
> > from pieces.
> > (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads.
> > (push_new_nary_ref): Insert a back-reference to vn_nary_op_t.
> > (val_equiv_insert): Record temporary equivalence.
> > (vn_nary_op_insert_pieces_predicated): Record equivalences instead
> of
> > some predicates; insert back-refs.
> > (record_equiv_from_prev_phi_1): Record temporary equivalences
> > generated
> > by PHI nodes.
> > (record_equiv_from_prev_phi): Given an outgoing edge of a
> conditional
> > expression taken, record equivalences generated by PHI nodes.
> > (visit_nary_op): Add lookup previous results of N-ary operations by
> > equivalences.
> > (insert_related_predicates_on_edge): Some predicates can be computed
> > from equivalences, no need to insert them.
> > (process_bb): Add lookup predicated values by equivalences.
> > (struct unwind_state): Unwind state of back-refs to vn_nary_op_t.
> > (do_unwind): Unwind the back-refs to vn_nary_op_t.
> > (do_rpo_vn): Update back-reference unwind state.
> > * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to
> > the
> > nary map entries.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.dg/pr83541.C: Disable fre.
> > * gcc.dg/tree-ssa/pr68619-2.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-1.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-2.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-3.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-5.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-7.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-8.c: Disable fre.
> > * gcc.dg/tree-ssa/pr71947-9.c: Disable fre.
> > * gcc.dg/tree-ssa/vrp03.c: Disable fre.
> > * gcc.dg/tree-ssa/ssa-fre-100.c: New test.
> > * gcc.dg/tree-ssa/ssa-fre-101.c: New test.
> > * gcc.dg/tree-ssa/ssa-fre-102.c: New test.
> > * gcc.dg/tree-ssa/ssa-pre-34.c: New test.


PING: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2022-07-11 Thread Di Zhao OS via Gcc-patches
Updated the patch in the attachment, so it can apply.

Thanks,
Di Zhao

> -Original Message-
> From: Di Zhao OS
> Sent: Sunday, May 29, 2022 11:59 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener 
> Subject: [PATCH v5] tree-optimization/101186 - extend FRE with "equivalence
> map" for condition prediction
> 
> Hi, attached is a new version of the patch. The changes are:
> - Skip using temporary equivalences for floating-point values, because
> folding expressions can generate incorrect values. For example,
> operations on 0.0 and -0.0 may have different results.
> - Avoid inserting duplicated back-refs from value-number to predicates.
> - Disable fre in testsuite/g++.dg/pr83541.C .
> 
> Summary of the previous versions:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html
> 
> Is the patch still considered?
> 
> Thanks,
> Di Zhao
> 
> ---
> 
> Extend FRE with temporary equivalences.
> 
> 2022-05-29  Di Zhao  
> 
> gcc/ChangeLog:
> PR tree-optimization/101186
> * tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a
> predicate already).
> (dominated_by_p_w_unex): Moved upward.
> (vn_nary_op_get_predicated_value): Moved upward.
> (is_vn_valid_at_bb): Check if vn_pval is valid at BB.
> (lookup_equiv_head): Lookup the "equivalence head" of given node.
> (lookup_equiv_heads): Lookup the "equivalence head"s of given nodes.
> (vn_tracking_edge): Extracted utility function.
> (init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s.
> (vn_nary_op_insert_into): Insert new value at the front.
> (vn_nary_op_insert_pieces_predicated_1): Insert as predicated values
> from pieces.
> (fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads.
> (push_new_nary_ref): Insert a back-reference to vn_nary_op_t.
> (val_equiv_insert): Record temporary equivalence.
> (vn_nary_op_insert_pieces_predicated): Record equivalences instead of
> some predicates; insert back-refs.
> (record_equiv_from_prev_phi_1): Record temporary equivalences
> generated
> by PHI nodes.
> (record_equiv_from_prev_phi): Given an outgoing edge of a conditional
> expression taken, record equivalences generated by PHI nodes.
> (visit_nary_op): Add lookup previous results of N-ary operations by
> equivalences.
> (insert_related_predicates_on_edge): Some predicates can be computed
> from equivalences, no need to insert them.
> (process_bb): Add lookup predicated values by equivalences.
> (struct unwind_state): Unwind state of back-refs to vn_nary_op_t.
> (do_unwind): Unwind the back-refs to vn_nary_op_t.
> (do_rpo_vn): Update back-reference unwind state.
> * tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to
> the
> nary map entries.
> 
> gcc/testsuite/ChangeLog:
> 
> * g++.dg/pr83541.C: Disable fre.
> * gcc.dg/tree-ssa/pr68619-2.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-1.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-2.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-3.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-5.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-7.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-8.c: Disable fre.
> * gcc.dg/tree-ssa/pr71947-9.c: Disable fre.
> * gcc.dg/tree-ssa/vrp03.c: Disable fre.
> * gcc.dg/tree-ssa/ssa-fre-100.c: New test.
> * gcc.dg/tree-ssa/ssa-fre-101.c: New test.
> * gcc.dg/tree-ssa/ssa-fre-102.c: New test.
> * gcc.dg/tree-ssa/ssa-pre-34.c: New test.


v5-tree-optimization-101186.patch
Description: v5-tree-optimization-101186.patch


[PATCH v5] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2022-05-29 Thread Di Zhao OS via Gcc-patches
Hi, attached is a new version of the patch. The changes are:
- Skip using temporary equivalences for floating-point values, because
folding expressions can generate incorrect values. For example, 
operations on 0.0 and -0.0 may have different results.
- Avoid inserting duplicated back-refs from value-number to predicates.
- Disable fre in testsuite/g++.dg/pr83541.C .

Summary of the previous versions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587346.html

Is the patch still considered?

Thanks,
Di Zhao

---

Extend FRE with temporary equivalences.

2022-05-29  Di Zhao  

gcc/ChangeLog:
PR tree-optimization/101186
* tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a
predicate already).
(dominated_by_p_w_unex): Moved upward.
(vn_nary_op_get_predicated_value): Moved upward.
(is_vn_valid_at_bb): Check if vn_pval is valid at BB.
(lookup_equiv_head): Lookup the "equivalence head" of given node.
(lookup_equiv_heads): Lookup the "equivalence head"s of given nodes.
(vn_tracking_edge): Extracted utility function.
(init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s.
(vn_nary_op_insert_into): Insert new value at the front.
(vn_nary_op_insert_pieces_predicated_1): Insert as predicated values
from pieces.
(fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads.
(push_new_nary_ref): Insert a back-reference to vn_nary_op_t.
(val_equiv_insert): Record temporary equivalence.
(vn_nary_op_insert_pieces_predicated): Record equivalences instead of
some predicates; insert back-refs.
(record_equiv_from_prev_phi_1): Record temporary equivalences generated
by PHI nodes.
(record_equiv_from_prev_phi): Given an outgoing edge of a conditional
expression taken, record equivalences generated by PHI nodes.
(visit_nary_op): Add lookup previous results of N-ary operations by
equivalences.
(insert_related_predicates_on_edge): Some predicates can be computed
from equivalences, no need to insert them.
(process_bb): Add lookup predicated values by equivalences.
(struct unwind_state): Unwind state of back-refs to vn_nary_op_t.
(do_unwind): Unwind the back-refs to vn_nary_op_t.
(do_rpo_vn): Update back-reference unwind state.
* tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the
nary map entries.

gcc/testsuite/ChangeLog:

* g++.dg/pr83541.C: Disable fre.
* gcc.dg/tree-ssa/pr68619-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-1.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-3.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-5.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-7.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-8.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-9.c: Disable fre.
* gcc.dg/tree-ssa/vrp03.c: Disable fre.
* gcc.dg/tree-ssa/ssa-fre-100.c: New test.
* gcc.dg/tree-ssa/ssa-fre-101.c: New test.
* gcc.dg/tree-ssa/ssa-fre-102.c: New test.
* gcc.dg/tree-ssa/ssa-pre-34.c: New test.


v5-tree-optimization-101186.patch
Description: v5-tree-optimization-101186.patch


[PATCH v4] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-12-24 Thread Di Zhao OS via Gcc-patches
Here's a brief summary on the patch:

v4 (this version):
- In process_bb's condition-prediction code: update equivalence-heads if
  value-numbers have changed, otherwise some chances can be lost.

v3 (a few minor updates): 
- Simplify function record_equiv_from_prev_phi_1 by removing an argument.
- Fixed two small bugs that can lead to losing optimize opportunities.
- Link: https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586687.html

v2:
- Use equivalence-heads (disjoint set) to represent temporary equivalences.
- Improved performance by reducing predicates-recording.
- The last version and some discussion:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586116.html
- Earlier discussion:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-October/580688.html
- The patch:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579607.html

v1: 
- The initial patch (use a hashmap to store temporary equivalences):
  https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573603.html
- Updates (fixed some code-style issue) and some explanation:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575531.html
- Reply from Richard:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576185.html

If the patch is considered, a following step I'm thinking is to
teach PRE pass with the temporary equivalences. Currently regression
test pr68619-4.c fails with the patch, because a variable's value
number turns to constant, and its equivalent's head is lost in the
AVAIL set. I think with this change, more partial redundancies can
be found.

Thanks,
Di Zhao

---

Extend FRE with temporary equivalences.

2021-12-24  Di Zhao  

gcc/ChangeLog:
PR tree-optimization/101186
* tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a
predicate already).
(dominated_by_p_w_unex): Moved upward.
(vn_nary_op_get_predicated_value): Moved upward.
(is_vn_valid_at_bb): Check if vn_pval is valid at BB.
(lookup_equiv_head): Lookup the "equivalence head" of given node.
(lookup_equiv_heads): Lookup the "equivalence head"s of given nodes.
(vn_tracking_edge): Extracted utility function.
(init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s.
(vn_nary_op_insert_into): Insert new value at the front.
(vn_nary_op_insert_pieces_predicated_1): Insert as predicated values
from pieces.
(fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads.
(push_new_nary_ref): Insert a back-reference to vn_nary_op_t.
(val_equiv_insert): Record temporary equivalence.
(vn_nary_op_insert_pieces_predicated): Record equivalences instead of
some predicates; insert back-refs.
(record_equiv_from_prev_phi_1): Record temporary equivalences generated
by PHI nodes.
(record_equiv_from_prev_phi): Given an outgoing edge of a conditional
expression taken, record equivalences generated by PHI nodes.
(visit_nary_op): Add lookup previous results of N-ary operations by
equivalences.
(insert_related_predicates_on_edge): Some predicates can be computed
from equivalences, no need to insert them.
(process_bb): Add lookup predicated values by equivalences.
(struct unwind_state): Unwind state of back-refs to vn_nary_op_t.
(do_unwind): Unwind the back-refs to vn_nary_op_t.
(do_rpo_vn): Update back-reference unwind state.
* tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the
nary map entries.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr68619-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-1.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-3.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-5.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-7.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-8.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-9.c: Disable fre.
* gcc.dg/tree-ssa/vrp03.c: Disable fre.
* gcc.dg/tree-ssa/ssa-fre-100.c: New test.
* gcc.dg/tree-ssa/ssa-fre-101.c: New test.
* gcc.dg/tree-ssa/ssa-fre-102.c: New test.
* gcc.dg/tree-ssa/ssa-pre-34.c: New test.


v4-tree-optimization-101186.patch
Description: v4-tree-optimization-101186.patch


[PATCH v3] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-12-13 Thread Di Zhao OS via Gcc-patches
A few minor updates on the patch:
- Simplify function record_equiv_from_prev_phi_1 by removing an argument.
- Fixed two small bugs that can lead to losing optimize opportunities.

Thanks,
Di Zhao

---

Extend FRE with temporary equivalences.

2021-12-13  Di Zhao  

gcc/ChangeLog:
PR tree-optimization/101186
* tree-ssa-sccvn.c (VN_INFO): remove assertions (there could be a
predicate already).
(dominated_by_p_w_unex): Moved upward.
(vn_nary_op_get_predicated_value): Moved upward.
(is_vn_valid_at_bb): Check if vn_pval is valid at BB.
(lookup_equiv_head): Lookup the "equivalence head" of given node.
(lookup_equiv_heads): Lookup the "equivalence head"s of given nodes.
(vn_tracking_edge): Extracted utility function.
(init_vn_nary_op_from_stmt): Insert and lookup by "equivalence head"s.
(vn_nary_op_insert_into): Insert new value at the front.
(vn_nary_op_insert_pieces_predicated_1): Insert as predicated values
from pieces.
(fold_const_from_equiv_heads): Fold N-ary expression of equiv-heads.
(push_new_nary_ref): Insert a back-reference to vn_nary_op_t.
(val_equiv_insert): Record temporary equivalence.
(vn_nary_op_insert_pieces_predicated): Record equivalences instead of
some predicates; insert back-refs.
(record_equiv_from_prev_phi_1): Record temporary equivalences generated
by PHI nodes.
(record_equiv_from_prev_phi): Given an outgoing edge of a conditional
expression taken, record equivalences generated by PHI nodes.
(visit_nary_op): Add lookup previous results of N-ary operations by
equivalences.
(insert_related_predicates_on_edge): Some predicates can be computed
from equivalences, no need to insert them.
(process_bb): Add lookup predicated values by 
equivalenc~/tmp/time-frees.
(struct unwind_state): Unwind state of back-refs to vn_nary_op_t.
(do_unwind): Unwind the back-refs to vn_nary_op_t.
(do_rpo_vn): Update back-reference unwind state.
* tree-ssa-sccvn.h (struct nary_ref): hold a lists of references to the
nary map entries.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr68619-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-1.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-2.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-3.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-5.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-7.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-8.c: Disable fre.
* gcc.dg/tree-ssa/pr71947-9.c: Disable fre.
* gcc.dg/tree-ssa/vrp03.c: Disable fre.
* gcc.dg/tree-ssa/ssa-fre-100.c: New test.
* gcc.dg/tree-ssa/ssa-fre-101.c: New test.
* gcc.dg/tree-ssa/ssa-fre-102.c: New test.
* gcc.dg/tree-ssa/ssa-pre-34.c: New test.


v3-tree-optimization-101186.patch
Description: v3-tree-optimization-101186.patch


RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-12-02 Thread Di Zhao OS via Gcc-patches
I'm very sorry there seems to be encoding issue in the attachment
in my last email. Attached is the new patch.  

Thanks,
Di Zhao

> -Original Message-
> From: Di Zhao OS
> Sent: Tuesday, November 16, 2021 1:24 AM
> To: 'Richard Biener' 
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v2] tree-optimization/101186 - extend FRE with
> "equivalence map" for condition prediction
> 
> Attached is the updated patch. Fixed some errors in testcases.
> 
> > -Original Message-
> > From: Richard Biener 
> > Sent: Wednesday, November 10, 2021 5:44 PM
> > To: Di Zhao OS 
> > Cc: gcc-patches@gcc.gnu.org; Andrew MacLeod 
> > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> > "equivalence map" for condition prediction
> >
> > On Sun, Oct 24, 2021 at 9:03 PM Di Zhao OS
> >  wrote:
> > >
> > > Hi,
> > >
> > > Attached is a new version of the patch, mainly for improving performance
> > > and simplifying the code.
> >
> > The patch doesn't apply anymore, can you update it please?
> >
> > I see the new ssa-fre-101.c test already passing without the patch.
> 
> It was a mistake in test ssa-fre-101.c::g to define the variables with
> the unsigned integers, in this way "a >= 0" is always true. After modified
> the case, now fre1 in the patch can remove unreachable call "foo ()", and
> evrp on trunk does not.
> 
> > Likewise ssa-fre-100.c and ssa-fre-102.c would PASS if you scan
> > the pass dump after fre1 which is evrp so it seems that evrp already
> > handles the equivalences (likely with the relation oracle) now?
> > I'm sure there are second order effects when eliminating conditions
> > in FRE but did you re-evaluate what made you improve VN to see
> > if the cases are handled as expected now without this change?
> 
> In case ssa-fre-102.c, the unreachable call "foo ()" can be removed by
> evrp, but fre in the patch can additionally replace "g = x + b" with
> "g = 99". (Again I'm sorry the regex to check this was wrong..)
> 
> Test cases to simulate the original problem I found are moved into
> gcc.dg/tree-ssa/ssa-pre-34.c. The unreachable calls to "foo ()" are
> still removed by pre with the patch. (If compiled with -O3, the
> "foo ()"s in test file can now be removed by thread2/threadfull2 and
> dom3 on trunk. This relies on jump threading across the loops, so
> even with -O3, similar cases may not get optimized say if there're
> too many statements to copy.)
> 
> Thanks,
> Di Zhao
> 
> >
> > I will still look at and consider the change btw, but given the EVRP
> > improvements I'm also considering to remove the predication
> > support from VN alltogether.  At least in the non-iterating mode
> > it should be trivially easy to use rangers relation oracle to simplify
> > predicates.  For the iterating mode it might not be 100% effective
> > since I'm not sure we can make it use the current SSA values and
> > how it would behave with those eventually changing to worse.
> >
> > Andrew, how would one ask the relation oracle to simplify a
> > condition?  Do I have to do any bookkeeping to register
> > predicates on edges for it?
> >
> > Thanks,
> > Richard.
> >
> > > First, regarding the comments:
> > >
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Friday, October 1, 2021 9:00 PM
> > > > To: Di Zhao OS 
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> > > > "equivalence map" for condition prediction
> > > >
> > > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS
> > > >  wrote:
> > > > >
> > > > > Sorry about updating on this after so long. It took me much time to 
> > > > > work out a
> > > > > new plan and pass the tests.
> > > > >
> > > > > The new idea is to use one variable to represent a set of equal 
> > > > > variables at
> > > > > some basic-block. This variable is called a "equivalence head" or 
> > > > > "equiv-head"
> > > > > in the code. (There's no-longer a "equivalence map".)
> > > > >
> > > > > - Initially an SSA_NAME's "equivalence head" is its value number. 
> > > > > Temporary
> > > > >   equivalence heads are recorded as unary NOP_EXPR results in the 
> > > > > vn_nary_op_t
> > > > >   map. Besides, when inserting into vn_nary_op_t map, make the new 
> > > > > result at
> > > > >   front of the vn_pval list, so that when searching for a variable's
> > > > >   equivalence head, the first result represents the largest 
> > > > > equivalence set at
> > > > >   current location.
> > > > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary 
> > > > > entry.
> > > > >   For recorded equivalences, the reference is result->entry; for 
> > > > > normal N-ary
> > > > >   operations, the reference is operand->entry.
> > > > > - When recording equivalences, if one side A is constant or has more 
> > > > > refs, make
> > > > >   it the new equivalence head of the other side B. Traverse B's ref- 
> > > > > list,if a
> > > > >   variable C's previous equiv-head is B, update to 

RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-11-15 Thread Di Zhao OS via Gcc-patches
Attached is the updated patch. Fixed some errors in testcases.

> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, November 10, 2021 5:44 PM
> To: Di Zhao OS 
> Cc: gcc-patches@gcc.gnu.org; Andrew MacLeod 
> Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> "equivalence map" for condition prediction
> 
> On Sun, Oct 24, 2021 at 9:03 PM Di Zhao OS
>  wrote:
> >
> > Hi,
> >
> > Attached is a new version of the patch, mainly for improving performance
> > and simplifying the code.
> 
> The patch doesn't apply anymore, can you update it please?
> 
> I see the new ssa-fre-101.c test already passing without the patch.

It was a mistake in test ssa-fre-101.c::g to define the variables with
the unsigned integers, in this way "a >= 0" is always true. After modified
the case, now fre1 in the patch can remove unreachable call "foo ()", and
evrp on trunk does not.

> Likewise ssa-fre-100.c and ssa-fre-102.c would PASS if you scan
> the pass dump after fre1 which is evrp so it seems that evrp already
> handles the equivalences (likely with the relation oracle) now?
> I'm sure there are second order effects when eliminating conditions
> in FRE but did you re-evaluate what made you improve VN to see
> if the cases are handled as expected now without this change?

In case ssa-fre-102.c, the unreachable call "foo ()" can be removed by
evrp, but fre in the patch can additionally replace "g = x + b" with
"g = 99". (Again I'm sorry the regex to check this was wrong..)

Test cases to simulate the original problem I found are moved into
gcc.dg/tree-ssa/ssa-pre-34.c. The unreachable calls to "foo ()" are
still removed by pre with the patch. (If compiled with -O3, the
"foo ()"s in test file can now be removed by thread2/threadfull2 and
dom3 on trunk. This relies on jump threading across the loops, so
even with -O3, similar cases may not get optimized say if there're
too many statements to copy.)

Thanks,
Di Zhao

> 
> I will still look at and consider the change btw, but given the EVRP
> improvements I'm also considering to remove the predication
> support from VN alltogether.  At least in the non-iterating mode
> it should be trivially easy to use rangers relation oracle to simplify
> predicates.  For the iterating mode it might not be 100% effective
> since I'm not sure we can make it use the current SSA values and
> how it would behave with those eventually changing to worse.
> 
> Andrew, how would one ask the relation oracle to simplify a
> condition?  Do I have to do any bookkeeping to register
> predicates on edges for it?
> 
> Thanks,
> Richard.
> 
> > First, regarding the comments:
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Friday, October 1, 2021 9:00 PM
> > > To: Di Zhao OS 
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> > > "equivalence map" for condition prediction
> > >
> > > On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS
> > >  wrote:
> > > >
> > > > Sorry about updating on this after so long. It took me much time to 
> > > > work out a
> > > > new plan and pass the tests.
> > > >
> > > > The new idea is to use one variable to represent a set of equal 
> > > > variables at
> > > > some basic-block. This variable is called a "equivalence head" or 
> > > > "equiv-head"
> > > > in the code. (There's no-longer a "equivalence map".)
> > > >
> > > > - Initially an SSA_NAME's "equivalence head" is its value number. 
> > > > Temporary
> > > >   equivalence heads are recorded as unary NOP_EXPR results in the 
> > > > vn_nary_op_t
> > > >   map. Besides, when inserting into vn_nary_op_t map, make the new 
> > > > result at
> > > >   front of the vn_pval list, so that when searching for a variable's
> > > >   equivalence head, the first result represents the largest equivalence 
> > > > set at
> > > >   current location.
> > > > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary 
> > > > entry.
> > > >   For recorded equivalences, the reference is result->entry; for normal 
> > > > N-ary
> > > >   operations, the reference is operand->entry.
> > > > - When recording equivalences, if one side A is constant or has more 
> > > > refs, make
> > > >   it the new equivalence head of the other side B. Traverse B's 
> > > > ref-list,if a
> > > >   variable C's previous equiv-head is B, update to A. And re-insert B's 
> > > > n-ary
> > > >   operations by replacing B with A.
> > > > - When inserting and looking for the results of n-ary operations, 
> > > > insert and
> > > >   lookup by the operands' equiv-heads.
> > > > ...
> > > >
> > > > Thanks,
> > > > Di Zhao
> > > >
> > > > 
> > > > Extend FRE with temporary equivalences.
> > >
> > > Comments on the patch:
> > >
> > > +  /* nary_ref count.  */
> > > +  unsigned num_nary_ref;
> > > +
> > >
> > > I think a unsigned short should be enough and that would nicely
> > > pack after value_id together with the bitfield (maybe change 

[PING] [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-11-07 Thread Di Zhao OS via Gcc-patches
Hi,

Gentle ping on this.

Di Zhao

-Original Message-
From: Di Zhao OS 
Sent: Monday, October 25, 2021 3:03 AM
To: Richard Biener 
Cc: gcc-patches@gcc.gnu.org
Subject: RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence 
map" for condition prediction

Hi,

Attached is a new version of the patch, mainly for improving performance
and simplifying the code.

First, regarding the comments:

> -Original Message-
> From: Richard Biener 
> Sent: Friday, October 1, 2021 9:00 PM
> To: Di Zhao OS 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> "equivalence map" for condition prediction
> 
> On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS
>  wrote:
> >
> > Sorry about updating on this after so long. It took me much time to work 
> > out a
> > new plan and pass the tests.
> >
> > The new idea is to use one variable to represent a set of equal variables at
> > some basic-block. This variable is called a "equivalence head" or 
> > "equiv-head"
> > in the code. (There's no-longer a "equivalence map".)
> >
> > - Initially an SSA_NAME's "equivalence head" is its value number. Temporary
> >   equivalence heads are recorded as unary NOP_EXPR results in the 
> > vn_nary_op_t
> >   map. Besides, when inserting into vn_nary_op_t map, make the new result at
> >   front of the vn_pval list, so that when searching for a variable's
> >   equivalence head, the first result represents the largest equivalence set 
> > at
> >   current location.
> > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry.
> >   For recorded equivalences, the reference is result->entry; for normal 
> > N-ary
> >   operations, the reference is operand->entry.
> > - When recording equivalences, if one side A is constant or has more refs, 
> > make
> >   it the new equivalence head of the other side B. Traverse B's ref-list, 
> > if a
> >   variable C's previous equiv-head is B, update to A. And re-insert B's 
> > n-ary
> >   operations by replacing B with A.
> > - When inserting and looking for the results of n-ary operations, insert and
> >   lookup by the operands' equiv-heads.
> > ...
> >
> > Thanks,
> > Di Zhao
> >
> > 
> > Extend FRE with temporary equivalences.
> 
> Comments on the patch:
> 
> +  /* nary_ref count.  */
> +  unsigned num_nary_ref;
> +
> 
> I think a unsigned short should be enough and that would nicely
> pack after value_id together with the bitfield (maybe change that
> to unsigned short :1 then).

Changed num_nary_ref to unsigned short and moved after value_id.

> @@ -7307,17 +7839,23 @@ process_bb (rpo_elim , basic_block bb,
> tree val = gimple_simplify (gimple_cond_code (last),
> boolean_type_node, lhs, rhs,
> NULL, vn_valueize);
> +   vn_nary_op_t vnresult = NULL;
> /* If the condition didn't simplfy see if we have recorded
>an expression from sofar taken edges.  */
> if (! val || TREE_CODE (val) != INTEGER_CST)
>   {
> -   vn_nary_op_t vnresult;
> 
> looks like you don't need vnresult outside of the if()?

vnresult is reused later to record equivalences generated by PHI nodes.

> +/* Find predicated value of vn_nary_op by the operands' equivalences.  Return
> + * NULL_TREE if no known result is found.  */
> +
> +static tree
> +find_predicated_value_by_equivs (vn_nary_op_t vno, basic_block bb,
> +vn_nary_op_t *vnresult)
> +{
> +  lookup_equiv_heads (vno->length, vno->op, vno->op, bb);
> +  tree result
> += simplify_nary_op (vno->length, vno->opcode, vno->op, vno->type);
> 
> why is it necessary to simplify here?  It looks like the caller
> already does this.

In the new patch, changed the code a little to remove redundant calculation.

> I wonder whether it's valid to always perform find_predicated_value_by_equivs
> from inside vn_nary_op_get_predicated_value instead of bolting it to only
> a single place?

Removed function find_predicated_value_by_equivs and inlined the code.

Because lookup_equiv_head uses vn_nary_op_get_predicated_value, so I left
vn_nary_op_get_predicated_value unchanged. Instead, operands are set to
equiv-heads in init_vn_nary_op_from_stmt. So altogether, predicates are always
inserted and searched by equiv-heads.

> +
> +static vn_nary_op_t
> +val_equiv_insert (tree op1, tree op2, edge e)
> +{
> 
> +  if (is_gimple_min_invariant (lhs))
> +std::swap (lhs, rhs);
> +  if (is_gimple_min_invariant (lhs) || TREE_CODE (lhs) != SSA_NAME)
> +/* Possible if invoked from record_equiv_from_previous_cond.  */
> +return NULL;
> 
> Better formulate all of the above in terms of only SSA_NAME checks since...
> 
> +  /* Make the hand-side with more recorded n-ary expressions new
> +   * equivalence-head, to make fewer re-insertions.  */
> +  if (TREE_CODE (rhs) == SSA_NAME
> +  && VN_INFO 

RE: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-10-24 Thread Di Zhao OS via Gcc-patches
Hi,

Attached is a new version of the patch, mainly for improving performance
and simplifying the code.

First, regarding the comments:

> -Original Message-
> From: Richard Biener 
> Sent: Friday, October 1, 2021 9:00 PM
> To: Di Zhao OS 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v2] tree-optimization/101186 - extend FRE with
> "equivalence map" for condition prediction
> 
> On Thu, Sep 16, 2021 at 8:13 PM Di Zhao OS
>  wrote:
> >
> > Sorry about updating on this after so long. It took me much time to work 
> > out a
> > new plan and pass the tests.
> >
> > The new idea is to use one variable to represent a set of equal variables at
> > some basic-block. This variable is called a "equivalence head" or 
> > "equiv-head"
> > in the code. (There's no-longer a "equivalence map".)
> >
> > - Initially an SSA_NAME's "equivalence head" is its value number. Temporary
> >   equivalence heads are recorded as unary NOP_EXPR results in the 
> > vn_nary_op_t
> >   map. Besides, when inserting into vn_nary_op_t map, make the new result at
> >   front of the vn_pval list, so that when searching for a variable's
> >   equivalence head, the first result represents the largest equivalence set 
> > at
> >   current location.
> > - In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry.
> >   For recorded equivalences, the reference is result->entry; for normal 
> > N-ary
> >   operations, the reference is operand->entry.
> > - When recording equivalences, if one side A is constant or has more refs, 
> > make
> >   it the new equivalence head of the other side B. Traverse B's ref-list, 
> > if a
> >   variable C's previous equiv-head is B, update to A. And re-insert B's 
> > n-ary
> >   operations by replacing B with A.
> > - When inserting and looking for the results of n-ary operations, insert and
> >   lookup by the operands' equiv-heads.
> > ...
> >
> > Thanks,
> > Di Zhao
> >
> > 
> > Extend FRE with temporary equivalences.
> 
> Comments on the patch:
> 
> +  /* nary_ref count.  */
> +  unsigned num_nary_ref;
> +
> 
> I think a unsigned short should be enough and that would nicely
> pack after value_id together with the bitfield (maybe change that
> to unsigned short :1 then).

Changed num_nary_ref to unsigned short and moved after value_id.

> @@ -7307,17 +7839,23 @@ process_bb (rpo_elim , basic_block bb,
> tree val = gimple_simplify (gimple_cond_code (last),
> boolean_type_node, lhs, rhs,
> NULL, vn_valueize);
> +   vn_nary_op_t vnresult = NULL;
> /* If the condition didn't simplfy see if we have recorded
>an expression from sofar taken edges.  */
> if (! val || TREE_CODE (val) != INTEGER_CST)
>   {
> -   vn_nary_op_t vnresult;
> 
> looks like you don't need vnresult outside of the if()?

vnresult is reused later to record equivalences generated by PHI nodes.

> +/* Find predicated value of vn_nary_op by the operands' equivalences.  Return
> + * NULL_TREE if no known result is found.  */
> +
> +static tree
> +find_predicated_value_by_equivs (vn_nary_op_t vno, basic_block bb,
> +vn_nary_op_t *vnresult)
> +{
> +  lookup_equiv_heads (vno->length, vno->op, vno->op, bb);
> +  tree result
> += simplify_nary_op (vno->length, vno->opcode, vno->op, vno->type);
> 
> why is it necessary to simplify here?  It looks like the caller
> already does this.

In the new patch, changed the code a little to remove redundant calculation.

> I wonder whether it's valid to always perform find_predicated_value_by_equivs
> from inside vn_nary_op_get_predicated_value instead of bolting it to only
> a single place?

Removed function find_predicated_value_by_equivs and inlined the code.

Because lookup_equiv_head uses vn_nary_op_get_predicated_value, so I left
vn_nary_op_get_predicated_value unchanged. Instead, operands are set to
equiv-heads in init_vn_nary_op_from_stmt. So altogether, predicates are always
inserted and searched by equiv-heads.

> +
> +static vn_nary_op_t
> +val_equiv_insert (tree op1, tree op2, edge e)
> +{
> 
> +  if (is_gimple_min_invariant (lhs))
> +std::swap (lhs, rhs);
> +  if (is_gimple_min_invariant (lhs) || TREE_CODE (lhs) != SSA_NAME)
> +/* Possible if invoked from record_equiv_from_previous_cond.  */
> +return NULL;
> 
> Better formulate all of the above in terms of only SSA_NAME checks since...
> 
> +  /* Make the hand-side with more recorded n-ary expressions new
> +   * equivalence-head, to make fewer re-insertions.  */
> +  if (TREE_CODE (rhs) == SSA_NAME
> +  && VN_INFO (rhs)->num_nary_ref < VN_INFO (lhs)->num_nary_ref)
> +std::swap (lhs, rhs);
> 
> here LHS needs to be an SSA_NAME.

Tried to fix this in the new patch.

> +  /* Record equivalence as unary NOP_EXPR.  */
> +  vn_nary_op_t val
> += vn_nary_op_insert_pieces_predicated_1 (1, NOP_EXPR, 

PING: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-09-30 Thread Di Zhao OS via Gcc-patches
Thanks,

Di

-Original Message-
From: Gcc-patches 
 On Behalf Of Di 
Zhao OS via Gcc-patches
Sent: Friday, September 17, 2021 2:13 AM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH v2] tree-optimization/101186 - extend FRE with "equivalence 
map" for condition prediction

Sorry about updating on this after so long. It took me much time to work out a
new plan and pass the tests.

The new idea is to use one variable to represent a set of equal variables at
some basic-block. This variable is called a "equivalence head" or "equiv-head"
in the code. (There's no-longer a "equivalence map".)

- Initially an SSA_NAME's "equivalence head" is its value number. Temporary
  equivalence heads are recorded as unary NOP_EXPR results in the vn_nary_op_t
  map. Besides, when inserting into vn_nary_op_t map, make the new result at
  front of the vn_pval list, so that when searching for a variable's
  equivalence head, the first result represents the largest equivalence set at
  current location.
- In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry.
  For recorded equivalences, the reference is result->entry; for normal N-ary
  operations, the reference is operand->entry.
- When recording equivalences, if one side A is constant or has more refs, make
  it the new equivalence head of the other side B. Traverse B's ref-list, if a
  variable C's previous equiv-head is B, update to A. And re-insert B's n-ary
  operations by replacing B with A.
- When inserting and looking for the results of n-ary operations, insert and
  lookup by the operands' equiv-heads.

So except for the refs in vn_ssa_aux_t, this scheme uses the original
infrastructure to its best. Quadric search time is avoided at the cost of some
re-insertions. Test results on SPEC2017 intrate (counts and percentages):

|more bb |more bb |more stmt|more stmt|more   |more
|removed |removed |removed  |removed  |nv_nary_ops|nv_nary_ops
|at fre1 |at fre1 |at fre1  |at fre1  |inserted   |inserted
--
 500.perlbench_r| 64 | 1.98%  | 103 | 0.19%   | 11260 | 12.16%
 502.gcc_r  | 671| 4.80%  | 317 | 0.23%   | 13964 | 6.09%
 505.mcf_r  | 5  | 35.71% | 9   | 1.40%   | 32| 2.52%
 520.omnetpp| 132| 5.45%  | 39  | 0.11%   | 1895  | 3.91%
 523.xalancbmk_r| 238| 3.26%  | 313 | 0.36%   | 1417  | 1.27%
 525.x264_r | 4  | 1.36%  | 27  | 0.11%   | 1752  | 6.78%
 531.deepsjeng_r| 1  | 3.45%  | 2   | 0.14%   | 228   | 8.67%
 541.leela_r| 2  | 0.63%  | 0   | 0.00%   | 92| 1.14%
 548.exchange2_r| 0  | 0.00%  | 3   | 0.04%   | 49| 1.03%
 557.xz_r   | 0  | 0.00%  | 3   | 0.07%   | 272   | 7.55%

There're more basic_blocks and statements removed compared with last
implementation, the reasons are:
1) "CONST op CONST" simplification is included. It is missed in previous patch.
2) By inserting RHS of statements on equiv-heads, more N-ary operations can be
   simplified. One example is in 'ssa-fre-97.c' in the patch file.

While jump-threading & vrp also utilize temporary equivalences (so some of the
newly removed blocks and statements can also be covered by them), I think this
patch is a supplement, in cases when jump threading cannot take place (the
original example), or value number info needs to be involved (the
'ssa-fre-97.c' example).

Fixed the former issue with non-iterate mode.

About recording the temporary equivalences generated by PHIs (i.e. the
'record_equiv_from_previous_cond' stuff), I have to admit it looks strange and
the code size is large, but I haven't find a better way yet. Consider a piece
of CFG like the one below, if we want to record x==x2 on the true edge when
processing bb1, the location (following current practice) will be bb2. But that
is not useful at bb5 or bb6, because bb2 doesn't dominate them. And I can't
find a place to record x==x1 when processing bb1.
If we can record things on edges rather than blocks, say x==x1 on 1->3 and
x==x2 on 1->2, then perhaps with an extra check for "a!=0", x2 can be a valid
equiv-head for x since bb5. But I think it lacks efficiency and is not
persuasive. It is more efficient to find a valid previous predicate when
processing bb4, because the vn_nary_op_t will be fetched anyway.
--
| if (a != 0) | bb1
--
f |  \ t
  |---
  || bb2 | 
  |---
  |  /
-
| x = PHI | bb3
-
   |
  
   |
   --
   | if (a != 0) | bb4
   --
   |f \t
-  ---
bb7 | where |  | bb5 |  ==&g

[PATCH v2] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-09-16 Thread Di Zhao OS via Gcc-patches
Sorry about updating on this after so long. It took me much time to work out a
new plan and pass the tests.

The new idea is to use one variable to represent a set of equal variables at
some basic-block. This variable is called a "equivalence head" or "equiv-head"
in the code. (There's no-longer a "equivalence map".)

- Initially an SSA_NAME's "equivalence head" is its value number. Temporary
  equivalence heads are recorded as unary NOP_EXPR results in the vn_nary_op_t
  map. Besides, when inserting into vn_nary_op_t map, make the new result at
  front of the vn_pval list, so that when searching for a variable's
  equivalence head, the first result represents the largest equivalence set at
  current location.
- In vn_ssa_aux_t, maintain a list of references to valid_info->nary entry.
  For recorded equivalences, the reference is result->entry; for normal N-ary
  operations, the reference is operand->entry.
- When recording equivalences, if one side A is constant or has more refs, make
  it the new equivalence head of the other side B. Traverse B's ref-list, if a
  variable C's previous equiv-head is B, update to A. And re-insert B's n-ary
  operations by replacing B with A.
- When inserting and looking for the results of n-ary operations, insert and
  lookup by the operands' equiv-heads.

So except for the refs in vn_ssa_aux_t, this scheme uses the original
infrastructure to its best. Quadric search time is avoided at the cost of some
re-insertions. Test results on SPEC2017 intrate (counts and percentages):

|more bb |more bb |more stmt|more stmt|more   |more
|removed |removed |removed  |removed  |nv_nary_ops|nv_nary_ops
|at fre1 |at fre1 |at fre1  |at fre1  |inserted   |inserted
--
 500.perlbench_r| 64 | 1.98%  | 103 | 0.19%   | 11260 | 12.16%
 502.gcc_r  | 671| 4.80%  | 317 | 0.23%   | 13964 | 6.09%
 505.mcf_r  | 5  | 35.71% | 9   | 1.40%   | 32| 2.52%
 520.omnetpp| 132| 5.45%  | 39  | 0.11%   | 1895  | 3.91%
 523.xalancbmk_r| 238| 3.26%  | 313 | 0.36%   | 1417  | 1.27%
 525.x264_r | 4  | 1.36%  | 27  | 0.11%   | 1752  | 6.78%
 531.deepsjeng_r| 1  | 3.45%  | 2   | 0.14%   | 228   | 8.67%
 541.leela_r| 2  | 0.63%  | 0   | 0.00%   | 92| 1.14%
 548.exchange2_r| 0  | 0.00%  | 3   | 0.04%   | 49| 1.03%
 557.xz_r   | 0  | 0.00%  | 3   | 0.07%   | 272   | 7.55%

There're more basic_blocks and statements removed compared with last
implementation, the reasons are:
1) "CONST op CONST" simplification is included. It is missed in previous patch.
2) By inserting RHS of statements on equiv-heads, more N-ary operations can be
   simplified. One example is in 'ssa-fre-97.c' in the patch file.

While jump-threading & vrp also utilize temporary equivalences (so some of the
newly removed blocks and statements can also be covered by them), I think this
patch is a supplement, in cases when jump threading cannot take place (the
original example), or value number info needs to be involved (the
'ssa-fre-97.c' example).

Fixed the former issue with non-iterate mode.

About recording the temporary equivalences generated by PHIs (i.e. the
'record_equiv_from_previous_cond' stuff), I have to admit it looks strange and
the code size is large, but I haven't find a better way yet. Consider a piece
of CFG like the one below, if we want to record x==x2 on the true edge when
processing bb1, the location (following current practice) will be bb2. But that
is not useful at bb5 or bb6, because bb2 doesn't dominate them. And I can't
find a place to record x==x1 when processing bb1.
If we can record things on edges rather than blocks, say x==x1 on 1->3 and
x==x2 on 1->2, then perhaps with an extra check for "a!=0", x2 can be a valid
equiv-head for x since bb5. But I think it lacks efficiency and is not
persuasive. It is more efficient to find a valid previous predicate when
processing bb4, because the vn_nary_op_t will be fetched anyway.
--
| if (a != 0) | bb1
--
f |  \ t
  |---
  || bb2 | 
  |---
  |  /
-
| x = PHI | bb3
-
   |
  
   |
   --
   | if (a != 0) | bb4
   --
   |f \t
-  ---
bb7 | where |  | bb5 |  ==> where "x==x2" is recorded now
| "x==x1" is|  ---
| recorded  |\
| now   |...
- |
   ---
   | bb6 |  ==> where "x==x2" needs to be used
   ---
Although I think I can remove the 'dominator_to_phi_map' and generalize this a
little, but the major logic will be similar. So I 

[PATCH] tree-optimization/102183 - sccvn: fix result compare in vn_nary_op_insert_into

2021-09-05 Thread Di Zhao OS via Gcc-patches
If the first predicate value is different and copied, the comparison will then 
be between val->result and the copied one, which seems to be a bug. That can 
cause inserting extra vn_pvals.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Regards,
Di Zhao

gcc/ChangeLog:

* tree-ssa-sccvn.c (vn_nary_op_insert_into): fix result compare
---
 gcc/tree-ssa-sccvn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 2357bbdbf90..bfa516b6cf9 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -4105,7 +4105,7 @@ vn_nary_op_insert_into (vn_nary_op_t vno, 
vn_nary_op_table_type *table,
  bool found = false;
  for (vn_pval *val = (*slot)->u.values; val; val = val->next)
{
- if (expressions_equal_p (val->result, vno->u.values->result))
+ if (expressions_equal_p (val->result, nval->result))
{
  found = true;
  for (unsigned i = 0; i < val->n; ++i)
-- 
2.25.1





RE: [PATCH] tree-optimization/101186 - extend FRE with "equivalence map" for condition prediction

2021-07-18 Thread Di Zhao OS via Gcc-patches

I tried to improve the patch following your advices and to catch more
opportunities. Hope it'll be helpful.

On 6/24/21 8:29 AM, Richard Biener wrote: 
> On Thu, Jun 24, 2021 at 11:55 AM Di Zhao via Gcc-patches  patc...@gcc.gnu.org> wrote:
> 
> I have some reservations about extending the ad-hoc "predicated value" code.
> 
> Some comments on the patch:
> 
> +/* hashtable & helpers to record equivalences at given bb.  */
> +
> +typedef struct val_equiv_s
> +{
> +  val_equiv_s *next;
> +  val_equiv_s *unwind_to;
> +  hashval_t hashcode;
> +  /* SSA name this val_equiv_s is associated with.  */
> +  tree name;
> +  /* RESULT in a vn_pval entry is SSA name of a equivalence.  */
> +  vn_pval *values;
> +} * val_equiv_t;
> 
> all of this (and using a hashtable for recording) is IMHO a bit overkill.
> Since you only ever record equivalences for values the more natural place to
> hook those in is the vn_ssa_aux structure where we also record the 
> availability
> chain.
 
I tried to store the equivalences in the vn_ssa_aux structure, but I didn't  
optimize the second case successfully: I need to record the equivalence
of a PHI expression's result and arguments, but their value number results will
become VARYING first, so they won't be changed. Maybe I'm missing something, or
can I force change a VARYING result?

Besides, the way currently used, equivalences only need to be "predictable"
rather than available, maybe availability chains do not represent them very
well?

> There's little commentary in the new code, in particular function-level
> comments are missing everywhere.

Added more comments.

> There's complexity issues, like I see val_equiv_insert has a "recurse"
> feature but also find_predicated_value_by_equiv is quadratic in the number of
> equivalences of the lhs/rhs.  Without knowing what the recursion on the
> former is for - nothing tells me - I suspect either of both should be 
> redundant.

The intention was, given {A==B, B==C, X==Y, Y==Z} and a previous result of
"C opcode Z", to find the result of "A opcode Y". I removed the "recurse"
feature and modified the searching logic so solve the issue. Now a temporary
hash_set is used to record the equivalences that are visited when searching.

> You seem to record equivalences at possible use points which looks odd at best
> - I'd expected equivalences being recorded at the same point we record
> predicated values and for the current condition, not the one determining some
> other predication.
> What was the motivation to do it the way you do it?

The purpose is to "bring down" what can be known from a previous basic-block
that effectively dominates current block, but not actually does so (in the
example it is because jump threading is hindered by a loop). For example in
this case:

  if (a != 0)
  // Nothing useful can be recorded here, because this BB doesn't dominate
  // the BB that we want to simplify.
  c = b;
  for (unsigned i = 0; i < c; i++)
{
  if (a != 0)  // The recording is triggered here.
   {
 // c == b will be recorded here, so it can be used for simplification.
 // In gimple it is the equivalence of a PHI's result and argument.
 if (i >= b) 
   foo ();

These requires finding a previous condition that is identical with current
one, so it is convenient to do this in FRE. Besides, as FRE records derived
predicate, so for relative conditions there also might be opportunities for
optimization. In the new patch code this is included.

Besides, to find more opportunities, added a hashmap to store mappings from
immediate dominators to basic-blocks with PHIs of interest.

> Why is the code conditional on 'iterate'?

I haven't worked it out to fit the non-iterate mode, so it now breaks the
if_conversion pass. I think this is because some of the equivalence-recordings
are too optimistic for non-iterate mode.

> You've probably realized that there's no "nice" way to handle temporary
> equivalences in a VN scheme using hashing for expressions (unless you degrade
> hashes a lot).

I modified the code to use TREE_HASH on ssa names. Would that be better?  

> You quote opportunities that are catched with this like
> 
> +  if (a != 0)
> +{
> +  c = b;
> +}
> +  for (unsigned i = 0; i < c; i++)
> +{
> +  if (a != 0)
> +   {
> + if (i >= b)
> +   /* Should be eliminated.
> +*/
> +   foo ();
> 
> but say other "obvious" cases like
> 
> +  if (a != 0)
> +{
> +  c = b;
> +}
> +  for (unsigned i = 0; i < c; i++)
> +{
> +  if (a != 0)
> +   {
> +   /* Should be zero.  */
>  return b - c;
> 
> are not handled.  That's in line with the current "value predication"
> which mainly aims at catching simple jump threading opportunities; you only
> simplify conditions with the recorded equivalences.  But then the complexity 
> of
> handling equivalences does probably not outweight the opportunities catched -