Re: [PATCH V5 1/2] Add overflow API for plus minus mult on range

2023-08-30 Thread guojiufu via Gcc-patches

On 2023-08-03 21:18, Andrew MacLeod wrote:

This is OK.



Thanks a lot!  Committed via r14-3582.


BR,
Jeff (Jiufu Guo)



On 8/2/23 22:18, Jiufu Guo wrote:

Hi,

I would like to have a ping on this patch.

BR,
Jeff (Jiufu Guo)


Jiufu Guo  writes:


Hi,

As discussed in previous reviews, adding overflow APIs to range-op
would be useful. Those APIs could help to check if overflow happens
when operating between two 'range's, like: plus, minus, and mult.

Previous discussions are here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624067.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624701.html

Bootstrap & regtest pass on ppc64{,le} and x86_64.
Is this patch ok for trunk?

BR,
Jeff (Jiufu Guo)

gcc/ChangeLog:

* range-op-mixed.h (operator_plus::overflow_free_p): New declare.
(operator_minus::overflow_free_p): New declare.
(operator_mult::overflow_free_p): New declare.
* range-op.cc (range_op_handler::overflow_free_p): New function.
(range_operator::overflow_free_p): New default function.
(operator_plus::overflow_free_p): New function.
(operator_minus::overflow_free_p): New function.
(operator_mult::overflow_free_p): New function.
* range-op.h (range_op_handler::overflow_free_p): New declare.
(range_operator::overflow_free_p): New declare.
* value-range.cc (irange::nonnegative_p): New function.
(irange::nonpositive_p): New function.
* value-range.h (irange::nonnegative_p): New declare.
(irange::nonpositive_p): New declare.

---
  gcc/range-op-mixed.h |  11 
  gcc/range-op.cc  | 124 
+++

  gcc/range-op.h   |   5 ++
  gcc/value-range.cc   |  12 +
  gcc/value-range.h|   2 +
  5 files changed, 154 insertions(+)

diff --git a/gcc/range-op-mixed.h b/gcc/range-op-mixed.h
index 6944742ecbc..42157ed9061 100644
--- a/gcc/range-op-mixed.h
+++ b/gcc/range-op-mixed.h
@@ -383,6 +383,10 @@ public:
  relation_kind rel) const final override;
void update_bitmask (irange , const irange ,
   const irange ) const final override;
+
+  virtual bool overflow_free_p (const irange , const irange ,
+   relation_trio = TRIO_VARYING) const;
+
  private:
void wi_fold (irange , tree type, const wide_int _lb,
const wide_int _ub, const wide_int _lb,
@@ -446,6 +450,10 @@ public:
relation_kind rel) const final override;
void update_bitmask (irange , const irange ,
   const irange ) const final override;
+
+  virtual bool overflow_free_p (const irange , const irange ,
+   relation_trio = TRIO_VARYING) const;
+
  private:
void wi_fold (irange , tree type, const wide_int _lb,
const wide_int _ub, const wide_int _lb,
@@ -525,6 +533,9 @@ public:
const REAL_VALUE_TYPE _lb, const REAL_VALUE_TYPE _ub,
const REAL_VALUE_TYPE _lb, const REAL_VALUE_TYPE _ub,
relation_kind kind) const final override;
+  virtual bool overflow_free_p (const irange , const irange ,
+   relation_trio = TRIO_VARYING) const;
+
  };
class operator_addr_expr : public range_operator
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index cb584314f4c..632b044331b 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -366,6 +366,22 @@ range_op_handler::op1_op2_relation (const vrange 
) const

  }
  }
  +bool
+range_op_handler::overflow_free_p (const vrange ,
+  const vrange ,
+  relation_trio rel) const
+{
+  gcc_checking_assert (m_operator);
+  switch (dispatch_kind (lh, lh, rh))
+{
+  case RO_III:
+   return m_operator->overflow_free_p(as_a  (lh),
+  as_a  (rh),
+  rel);
+  default:
+   return false;
+}
+}
// Convert irange bitmasks into a VALUE MASK pair suitable for 
calling CCP.
  @@ -688,6 +704,13 @@ range_operator::op1_op2_relation_effect 
(irange _range ATTRIBUTE_UNUSED,

return false;
  }
  +bool
+range_operator::overflow_free_p (const irange &, const irange &,
+relation_trio) const
+{
+  return false;
+}
+
  // Apply any known bitmask updates based on this operator.
void
@@ -4311,6 +4334,107 @@ range_op_table::initialize_integral_ops ()
}
  +bool
+operator_plus::overflow_free_p (const irange , const irange ,
+   relation_trio) const
+{
+  if (lh.undefined_p () || rh.undefined_p ())
+return false;
+
+  tree type = lh.type ();
+  if (TYPE_OVERFLOW_UNDEFINED (type))
+return true;
+
+  wi::overflow_type ovf;
+  signop sgn = TYPE_SIGN (type);
+  wide_int wmax0 = lh.upper_bound ();
+  wide_int wmax1 = rh.upper_bound ();
+  wi::add (wmax0, wmax1, sgn, );
+  if 

Ping^^ [PATCH V5 2/2] Optimize '(X - N * M) / N' to 'X / N - M' if valid

2023-08-22 Thread guojiufu via Gcc-patches

Hi,

I would like to have a gentle ping...

BR,
Jeff (Jiufu Guo)

On 2023-08-07 10:45, guojiufu via Gcc-patches wrote:

Hi,

Gentle ping...

On 2023-07-18 22:05, Jiufu Guo wrote:

Hi,

Integer expression "(X - N * M) / N" can be optimized to "X / N - M"
if there is no wrap/overflow/underflow and "X - N * M" has the same
sign with "X".

Compare the previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624067.html
- APIs: overflow, nonnegative_p and nonpositive_p are moved close
  to value range.
- Use above APIs in match.pd.

Bootstrap & regtest pass on ppc64{,le} and x86_64.
Is this patch ok for trunk?

BR,
Jeff (Jiufu Guo)

PR tree-optimization/108757

gcc/ChangeLog:

* match.pd ((X - N * M) / N): New pattern.
((X + N * M) / N): New pattern.
((X + C) div_rshift N): New pattern.

gcc/testsuite/ChangeLog:

* gcc.dg/pr108757-1.c: New test.
* gcc.dg/pr108757-2.c: New test.
* gcc.dg/pr108757.h: New test.

---
 gcc/match.pd  |  85 +++
 gcc/testsuite/gcc.dg/pr108757-1.c |  18 +++
 gcc/testsuite/gcc.dg/pr108757-2.c |  19 +++
 gcc/testsuite/gcc.dg/pr108757.h   | 233 
++

 4 files changed, 355 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr108757-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr108757-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pr108757.h

diff --git a/gcc/match.pd b/gcc/match.pd
index 8543f777a28..39dbb0567dc 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -942,6 +942,91 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 #endif


+#if GIMPLE
+(for div (trunc_div exact_div)
+ /* Simplify (t + M*N) / N -> t / N + M.  */
+ (simplify
+  (div (plus:c@4 @0 (mult:c@3 @1 @2)) @2)
+  (with {value_range vr0, vr1, vr2, vr3, vr4;}
+  (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (vr1, @1)
+   && get_range_query (cfun)->range_of_expr (vr2, @2)
+   && range_op_handler (MULT_EXPR).overflow_free_p (vr1, vr2)
+   && get_range_query (cfun)->range_of_expr (vr0, @0)
+   && get_range_query (cfun)->range_of_expr (vr3, @3)
+   && range_op_handler (PLUS_EXPR).overflow_free_p (vr0, vr3)
+   && get_range_query (cfun)->range_of_expr (vr4, @4)
+   && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr4.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr4.nonpositive_p (
+  (plus (div @0 @2) @1
+
+ /* Simplify (t - M*N) / N -> t / N - M.  */
+ (simplify
+  (div (minus@4 @0 (mult:c@3 @1 @2)) @2)
+  (with {value_range vr0, vr1, vr2, vr3, vr4;}
+  (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (vr1, @1)
+   && get_range_query (cfun)->range_of_expr (vr2, @2)
+   && range_op_handler (MULT_EXPR).overflow_free_p (vr1, vr2)
+   && get_range_query (cfun)->range_of_expr (vr0, @0)
+   && get_range_query (cfun)->range_of_expr (vr3, @3)
+   && range_op_handler (MINUS_EXPR).overflow_free_p (vr0, vr3)
+   && get_range_query (cfun)->range_of_expr (vr4, @4)
+   && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr4.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr4.nonpositive_p (
+  (minus (div @0 @2) @1)
+
+/* Simplify
+   (t + C) / N -> t / N + C / N where C is multiple of N.
+   (t + C) >> N -> t >> N + C>>N if low N bits of C is 0.  */
+(for op (trunc_div exact_div rshift)
+ (simplify
+  (op (plus@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+   (with
+{
+  wide_int c = wi::to_wide (@1);
+  wide_int n = wi::to_wide (@2);
+  bool is_rshift = op == RSHIFT_EXPR;
+  bool neg_c = false;
+  bool ok = false;
+  value_range vr0;
+  if (INTEGRAL_TYPE_P (type)
+ && get_range_query (cfun)->range_of_expr (vr0, @0))
+{
+ ok = is_rshift ? wi::ctz (c) >= n.to_shwi ()
+: wi::multiple_of_p (c, n, TYPE_SIGN (type));
+ value_range vr1, vr3;
+ ok = ok && get_range_query (cfun)->range_of_expr (vr1, @1)
+  && range_op_handler (PLUS_EXPR).overflow_free_p (vr0, vr1)
+  && get_range_query (cfun)->range_of_expr (vr3, @3)
+  && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr3.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr3.nonpositive_p ()));
+
+ /* Try check 'X + C' as 'X - -C' for unsigned.  */
+ if (!ok && TYPE_UNSIGNED (type) && c.sign_mask () < 0)
+   {
+ neg_c = true;
+ c = -c;
+ ok = is_rshift ? wi::ctz (c) >= n.to_shwi ()
+: wi::multiple_of_p (c, n,

Re: [PATCH V4 1/4] rs6000: build constant via li;rotldi

2023-08-18 Thread guojiufu via Gcc-patches



Hi Segher,

As discussed on "~" vs. "-",  "~" is correct for this patch.

I updated the patch according to Kewen's comments.

If ok,  I would commit to trunk.

BR,
Jeff (Jiufu Guo)


On 2023-07-04 11:28, Kewen.Lin wrote:

Hi Jeff,

on 2023/7/4 10:18, Jiufu Guo via Gcc-patches wrote:

Hi,

If a constant is possible to be rotated to/from a positive or negative
value from "li", then "li;rotldi" can be used to build the constant.

Compare with the previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621961.html
This patch just did minor changes to the style and comments.

Bootstrap and regtest pass on ppc64{,le}.

Since the previous version is approved with conditions, this version
explained the concern too.  If no objection, I would like to apply
this patch to trunk.


BR,
Jeff (Jiufu)

gcc/ChangeLog:

	* config/rs6000/rs6000.cc (can_be_built_by_li_and_rotldi): New 
function.

(rs6000_emit_set_long_const): Call can_be_built_by_li_and_rotldi.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/const-build.c: New test.
---
 gcc/config/rs6000/rs6000.cc   | 47 +--
 .../gcc.target/powerpc/const-build.c  | 57 
+++

 2 files changed, 98 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/const-build.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 42f49e4a56b..acc332acc05 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -10258,6 +10258,31 @@ rs6000_emit_set_const (rtx dest, rtx source)
   return true;
 }

+/* Check if value C can be built by 2 instructions: one is 'li', 
another is

+   rotldi.


Nit: different style, li is with "'" but rotldi isn't.


+
+   If so, *SHIFT is set to the shift operand of rotldi(rldicl), and 
*MASK

+   is set to the mask operand of rotldi(rldicl), and return true.
+   Return false otherwise.  */
+
+static bool
+can_be_built_by_li_and_rotldi (HOST_WIDE_INT c, int *shift,
+  HOST_WIDE_INT *mask)
+{
+  /* If C or ~C contains at least 49 successive zeros, then C can be 
rotated
+ to/from a positive or negative value that 'li' is able to load.  
*/

+  int n;
+  if (can_be_rotated_to_lowbits (c, 15, )
+  || can_be_rotated_to_lowbits (~c, 15, ))
+{
+  *mask = HOST_WIDE_INT_M1;
+  *shift = HOST_BITS_PER_WIDE_INT - n;
+  return true;
+}
+
+  return false;
+}
+
 /* Subroutine of rs6000_emit_set_const, handling PowerPC64 DImode.
Output insns to set DEST equal to the constant C as a series of
lis, ori and shl instructions.  */
@@ -10266,15 +10291,14 @@ static void
 rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
 {
   rtx temp;
+  int shift;
+  HOST_WIDE_INT mask;
   HOST_WIDE_INT ud1, ud2, ud3, ud4;

   ud1 = c & 0x;
-  c = c >> 16;
-  ud2 = c & 0x;
-  c = c >> 16;
-  ud3 = c & 0x;
-  c = c >> 16;
-  ud4 = c & 0x;
+  ud2 = (c >> 16) & 0x;
+  ud3 = (c >> 32) & 0x;
+  ud4 = (c >> 48) & 0x;

   if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 
0x8000))

   || (ud4 == 0 && ud3 == 0 && ud2 == 0 && ! (ud1 & 0x8000)))
@@ -10305,6 +10329,17 @@ rs6000_emit_set_long_const (rtx dest, 
HOST_WIDE_INT c)

   emit_move_insn (dest, gen_rtx_XOR (DImode, temp,
 GEN_INT ((ud2 ^ 0x) << 16)));
 }
+  else if (can_be_built_by_li_and_rotldi (c, , ))
+{
+  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+  unsigned HOST_WIDE_INT imm = (c | ~mask);
+  imm = (imm >> shift) | (imm << (HOST_BITS_PER_WIDE_INT - 
shift));

+
+  emit_move_insn (temp, GEN_INT (imm));
+  if (shift != 0)
+   temp = gen_rtx_ROTATE (DImode, temp, GEN_INT (shift));
+  emit_move_insn (dest, temp);
+}
   else if (ud3 == 0 && ud4 == 0)
 {
   temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
diff --git a/gcc/testsuite/gcc.target/powerpc/const-build.c 
b/gcc/testsuite/gcc.target/powerpc/const-build.c

new file mode 100644
index 000..69b37e2bb53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/const-build.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-require-effective-target has_arch_ppc64 } */
+
+/* Verify that two instructions are sucessfully used to build 
constants.


s/sucessfully/successfully/

+   One insn is li or lis, another is rotate: rldicl, rldicr or rldic. 
 */


Nit: This patch is for insn li + insn rldicl only, you probably want to 
keep

consistent in the comments.

The others look good to me, thanks!

Segher had one question on "~c" before, I saw you had explained for it, 
it
makes sense to me, but in case he has more questions I'd defer the 
final

approval to him.

BR,
Kewen


Re: [PATCH V5 2/2] Optimize '(X - N * M) / N' to 'X / N - M' if valid

2023-08-06 Thread guojiufu via Gcc-patches



Hi,

Gentle ping...

On 2023-07-18 22:05, Jiufu Guo wrote:

Hi,

Integer expression "(X - N * M) / N" can be optimized to "X / N - M"
if there is no wrap/overflow/underflow and "X - N * M" has the same
sign with "X".

Compare the previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624067.html
- APIs: overflow, nonnegative_p and nonpositive_p are moved close
  to value range.
- Use above APIs in match.pd.

Bootstrap & regtest pass on ppc64{,le} and x86_64.
Is this patch ok for trunk?

BR,
Jeff (Jiufu Guo)

PR tree-optimization/108757

gcc/ChangeLog:

* match.pd ((X - N * M) / N): New pattern.
((X + N * M) / N): New pattern.
((X + C) div_rshift N): New pattern.

gcc/testsuite/ChangeLog:

* gcc.dg/pr108757-1.c: New test.
* gcc.dg/pr108757-2.c: New test.
* gcc.dg/pr108757.h: New test.

---
 gcc/match.pd  |  85 +++
 gcc/testsuite/gcc.dg/pr108757-1.c |  18 +++
 gcc/testsuite/gcc.dg/pr108757-2.c |  19 +++
 gcc/testsuite/gcc.dg/pr108757.h   | 233 ++
 4 files changed, 355 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr108757-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr108757-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pr108757.h

diff --git a/gcc/match.pd b/gcc/match.pd
index 8543f777a28..39dbb0567dc 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -942,6 +942,91 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 #endif


+#if GIMPLE
+(for div (trunc_div exact_div)
+ /* Simplify (t + M*N) / N -> t / N + M.  */
+ (simplify
+  (div (plus:c@4 @0 (mult:c@3 @1 @2)) @2)
+  (with {value_range vr0, vr1, vr2, vr3, vr4;}
+  (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (vr1, @1)
+   && get_range_query (cfun)->range_of_expr (vr2, @2)
+   && range_op_handler (MULT_EXPR).overflow_free_p (vr1, vr2)
+   && get_range_query (cfun)->range_of_expr (vr0, @0)
+   && get_range_query (cfun)->range_of_expr (vr3, @3)
+   && range_op_handler (PLUS_EXPR).overflow_free_p (vr0, vr3)
+   && get_range_query (cfun)->range_of_expr (vr4, @4)
+   && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr4.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr4.nonpositive_p (
+  (plus (div @0 @2) @1
+
+ /* Simplify (t - M*N) / N -> t / N - M.  */
+ (simplify
+  (div (minus@4 @0 (mult:c@3 @1 @2)) @2)
+  (with {value_range vr0, vr1, vr2, vr3, vr4;}
+  (if (INTEGRAL_TYPE_P (type)
+   && get_range_query (cfun)->range_of_expr (vr1, @1)
+   && get_range_query (cfun)->range_of_expr (vr2, @2)
+   && range_op_handler (MULT_EXPR).overflow_free_p (vr1, vr2)
+   && get_range_query (cfun)->range_of_expr (vr0, @0)
+   && get_range_query (cfun)->range_of_expr (vr3, @3)
+   && range_op_handler (MINUS_EXPR).overflow_free_p (vr0, vr3)
+   && get_range_query (cfun)->range_of_expr (vr4, @4)
+   && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr4.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr4.nonpositive_p (
+  (minus (div @0 @2) @1)
+
+/* Simplify
+   (t + C) / N -> t / N + C / N where C is multiple of N.
+   (t + C) >> N -> t >> N + C>>N if low N bits of C is 0.  */
+(for op (trunc_div exact_div rshift)
+ (simplify
+  (op (plus@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+   (with
+{
+  wide_int c = wi::to_wide (@1);
+  wide_int n = wi::to_wide (@2);
+  bool is_rshift = op == RSHIFT_EXPR;
+  bool neg_c = false;
+  bool ok = false;
+  value_range vr0;
+  if (INTEGRAL_TYPE_P (type)
+ && get_range_query (cfun)->range_of_expr (vr0, @0))
+{
+ ok = is_rshift ? wi::ctz (c) >= n.to_shwi ()
+: wi::multiple_of_p (c, n, TYPE_SIGN (type));
+ value_range vr1, vr3;
+ ok = ok && get_range_query (cfun)->range_of_expr (vr1, @1)
+  && range_op_handler (PLUS_EXPR).overflow_free_p (vr0, vr1)
+  && get_range_query (cfun)->range_of_expr (vr3, @3)
+  && (TYPE_UNSIGNED (type)
+  || (vr0.nonnegative_p () && vr3.nonnegative_p ())
+  || (vr0.nonpositive_p () && vr3.nonpositive_p ()));
+
+ /* Try check 'X + C' as 'X - -C' for unsigned.  */
+ if (!ok && TYPE_UNSIGNED (type) && c.sign_mask () < 0)
+   {
+ neg_c = true;
+ c = -c;
+ ok = is_rshift ? wi::ctz (c) >= n.to_shwi ()
+: wi::multiple_of_p (c, n, UNSIGNED);
+ ok = ok && wi::geu_p (vr0.lower_bound (), c);
+   }
+   }
+}
+   (if (ok)
+   (with
+{
+  wide_int m;
+  m = is_rshift ? wi::rshift (c, n, TYPE_SIGN (type))
+   : wi::div_trunc (c, n, TYPE_SIGN (type));
+  m = neg_c ? -m : m;
+}
+   (plus (op @0 @2) { wide_int_to_tree(type, m); }))
+#endif
+
 (for op (negate abs)
  /* Simplify cos(-x) and cos(|x|) -> cos(x).  Similarly for cosh.  */
  (for 

Re: [RFC] light expander sra for parameters and returns

2023-08-02 Thread guojiufu via Gcc-patches

On 2023-08-02 20:41, Richard Biener wrote:

On Tue, 1 Aug 2023, Jiufu Guo wrote:



Hi,

Richard Biener  writes:

> On Mon, 24 Jul 2023, Jiufu Guo wrote:
>
>>
>> Hi Martin,
>>
>> Not sure about your current option about re-using the ipa-sra code
>> in the light-expander-sra. And if anything I could input please
>> let me know.
>>
>> And I'm thinking about the difference between the expander-sra, ipa-sra
>> and tree-sra. 1. For stmts walking, expander-sra has special behavior
>> for return-stmt, and also a little special on assign-stmt. And phi
>> stmts are not checked by ipa-sra/tree-sra. 2. For the access structure,
>> I'm also thinking if we need a tree structure; it would be useful when
>> checking overlaps, it was not used now in the expander-sra.
>>
>> For ipa-sra and tree-sra, I notice that there is some similar code,
>> but of cause there are differences. While it seems the difference
>> is 'intended', for example: 1. when creating and accessing,
>> 'size != max_size' is acceptable in tree-sra but not for ipa-sra.
>> 2. 'AGGREGATE_TYPE_P' for ipa-sra is accepted for some cases, but
>> not ok for tree-ipa.
>> I'm wondering if those slight difference blocks re-use the code
>> between ipa-sra and tree-sra.
>>
>> The expander-sra may be more light, for example, maybe we can use
>> FOR_EACH_IMM_USE_STMT to check the usage of each parameter, and not
>> need to walk all the stmts.
>
> What I was hoping for is shared stmt-level analysis and a shared
> data structure for the "access"(es) a stmt performs.  Because that
> can come up handy in multiple places.  The existing SRA data
> structures could easily embed that subset for example if sharing
> the whole data structure of [IPA] SRA seems too unwieldly.

Understand.
The stmt-level analysis and "access" data structure are similar
between ipa-sra/tree-sra and the expander-sra.

I just update the patch, this version does not change the behaviors of
the previous version.  It is just cleaning/merging some functions 
only.

The patch is attached.

This version (and tree-sra/ipa-sra) is still using the similar
"stmt analyze" and "access struct"".  This could be extracted as
shared code.
I'm thinking to update the code to use the same "base_access" and
"walk function".

>
> With a stmt-leve API using FOR_EACH_IMM_USE_STMT would still be
> possible (though RTL expansion pre-walks all stmts anyway).

Yeap, I also notice that "FOR_EACH_IMM_USE_STMT" is not enough.
For struct parameters, walking stmt is needed.


I think I mentioned this before, RTL expansion already
pre-walks the whole function looking for variables it has to
expand to the stack in discover_nonconstant_array_refs (which is
now badly named), I'd appreciate if the "SRA" walk would piggy-back
on that existing walk.


Yes.  I also had a look at discover_nonconstant_array_refs, it seems
this function takes care only of 'call_internal' and 'vdef' stmt for
array access.   But sra cares more about 'assign/call'.
The common thing is just the loop header between these two "stmt-walk"s.

  FOR_EACH_BB_FN (bb, cfun)
for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next ())
  {
gimple *stmt = gsi_stmt (gsi);

So, the existing walk is not used.
Another reason to have a new walk is that: the sra walk code may be
shared for tree-sra/ipa-sra.



For RTL expansion I think a critical part is to create accesses
based on the incoming/outgoing RTL which is specified by the ABI.
As I understand we are optimizing the argument setup code which
assigns the incoming arguments to either pseudo(s) or the stack
and thus we get to choose an optimized "mode" for that virtual
location of the incoming arguments (but we can't alter their
hardregs/stack assignment obviously).


Yes, this is what I'm trying to do.
It is "set_scalar_rtx_for_aggregate_access", which is called after
incoming arguments are set up, and then assign the incoming hard 
registers

to pseudo(s).  Those pseudo(s) are the scalarized rtx for the argument.


 So when we have an
incoming register pair we should create an artificial access
for the pieces those two registers represent.

You seem to do quite some adjustment to the parameter setup
where I was hoping we get away with simply choosing a different
mode for the virtual argument representation?


I insert the code in the parameter setup, where the incoming registers
are computed, and assigning the incoming regs to scalar pseudo(s).
(copying the incoming registers to stack would be optimized out by rtl 
passes.

Yes, it would be better to avoid generating them.)



But I'm not too familiar with the innards of parameter/return
value initial RTL expansion.  I hope somebody else can chime
in here as well.


Thanks so much for your very helpful comments!

BR,
Jeff (Jiufu Guo)



Richard.




BR,
Jeff (Jiufu Guo)

-
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index edf292cfbe9..8c36ad5df79 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -97,6 +97,502 @@ 

Re: [PATCH 4/4] rs6000: build constant via li/lis;rldic

2023-06-15 Thread guojiufu via Gcc-patches

On 2023-06-13 17:18, Jiufu Guo via Gcc-patches wrote:

Hi David,

Thanks for your valuable comments!

David Edelsohn  writes:



...
Do you have any measurement of how expensive it is to test all of 
these additional methods to generate a constant?  How much does this 
affect the

compile time?


Yeap, Thanks for this very good question!
This patch is mostly using bitwise operations and if-conditions,
it would be expected not expensive.

Testcases were checked.  For example:
A case with ~1000 constants: most of them hit this feature.
With this feature, the compiling time is slightly faster.

0m1.985s(without patch) vs. 0m1.874s(with patch)
(note:D rs6000_emit_set_long_const does not occur in hot perf
functions.  So, the tricky time saving would not directly cause
by this feature.)

A case with ~1000 constants:(most are not hit by this feature)
0m2.493s(without patch) vs. 0m2.558s(with patch).


Typo, this should be:
0m2.493s(with patch) vs. 0m2.558s(without patch).

It is also faster with the patch :)

BR,
Jeff (Jiufu Guo)



For runtime, actually, with the patch, it seems there is no visible
improvement in SPEC2017.  While I still feel this patch is
doing the right thing: use fewer instructions to build the constant.

BR,
Jeff (Jiufu Guo)



Thanks, David





Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-09 Thread guojiufu via Gcc-patches

Hi,

On 2023-06-09 16:00, Richard Biener wrote:

On Fri, 9 Jun 2023, Jiufu Guo wrote:


Hi,

As checking the code, there is a "gcc_assert (SCALAR_INT_MODE_P 
(mode))"

in "try_const_anchors".
This assert seems correct because the function try_const_anchors cares
about integer values currently, and modes other than SCALAR_INT_MODE_P
are not needed to support.

This patch makes sure SCALAR_INT_MODE_P when calling 
try_const_anchors.


This patch is raised when drafting below one.
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
try_const_anchors, and hits the assert/ice.

Boostrap and regtest pass on ppc64{,le} and x86_64.
Is this ok for trunk?


Iff the correct fix at all (how can a CONST_INT have BLKmode?) then
I suggest to instead fix try_const_anchors to change

  /* CONST_INT is used for CC modes, but we should leave those alone.  
*/

  if (GET_MODE_CLASS (mode) == MODE_CC)
return NULL_RTX;

  gcc_assert (SCALAR_INT_MODE_P (mode));

to

  /* CONST_INT is used for CC modes, leave any non-scalar-int mode 
alone.  */

  if (!SCALAR_INT_MODE_P (mode))
return NULL_RTX;



This is also able to fix this issue.  there is a "Punt on CC modes" 
patch

to return NULL_RTX in try_const_anchors.


but as said I wonder how we arrive at a BLKmode CONST_INT and whether
we should have fended this off earlier.  Can you share more complete
RTL of that stack_tie?



(insn 15 14 16 3 (parallel [
(set (mem/c:BLK (reg/f:DI 1 1) [1  A8])
(const_int 0 [0]))
]) "/home/guojiufu/temp/gdb.c":13:3 922 {stack_tie}
 (nil))

It is "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])".

This is generated by:

rs6000.md
(define_expand "restore_stack_block"
  [(set (match_dup 2) (match_dup 3))
   (set (match_dup 4) (match_dup 2))
   (match_dup 5)
   (set (match_operand 0 "register_operand")
(match_operand 1 "register_operand"))]
  ""
{
  rtvec p;

  operands[1] = force_reg (Pmode, operands[1]);
  operands[2] = gen_reg_rtx (Pmode);
  operands[3] = gen_frame_mem (Pmode, operands[0]);
  operands[4] = gen_frame_mem (Pmode, operands[1]);
  p = rtvec_alloc (1);
  RTVEC_ELT (p, 0) = gen_rtx_SET (gen_frame_mem (BLKmode, operands[0]),
  const0_rtx);
  operands[5] = gen_rtx_PARALLEL (VOIDmode, p);
})

This kind of case (like BLK with const0) is rare, but this would be an 
intended

RTL, and seems not invalid.

Thanks so much for your quick and very helpful comments!!

BR,
Jeff (Jiufu Guo)






BR,
Jeff (Jiufu Guo)

gcc/ChangeLog:

* cse.cc (cse_insn): Add SCALAR_INT_MODE_P condition.

---
 gcc/cse.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/cse.cc b/gcc/cse.cc
index 2bb63ac4105..f213fa0faf7 100644
*** a/gcc/cse.cc
--- b/gcc/cse.cc
***
*** 5003,5009 
if (targetm.const_anchor
  && !src_related
  && src_const
! && GET_CODE (src_const) == CONST_INT)
{
  src_related = try_const_anchors (src_const, mode);
  src_related_is_const_anchor = src_related != NULL_RTX;
- -
--- 5003,5010 
if (targetm.const_anchor
  && !src_related
  && src_const
! && GET_CODE (src_const) == CONST_INT
! && SCALAR_INT_MODE_P (mode))
{
  src_related = try_const_anchors (src_const, mode);
  src_related_is_const_anchor = src_related != NULL_RTX;
2.39.3




Re: [PATCH V5] Use reg mode to move sub blocks for parameters and returns

2023-06-06 Thread guojiufu via Gcc-patches

Hi,

On 2023-06-05 00:59, Jeff Law wrote:

On 5/9/23 07:43, Jiufu Guo wrote:


Thanks for point out this!  Yes, BLKmode rtx may not always be a MEM.
MEM_SIZE is only ok for MEM after the it's known size is computed.
Here MEM_SIZE is fine just because it is an stack rtx corresponding
to the type of parameter and returns which has been computed.

I updated the patch to resolve the conflicts with the trunk, and
retest bootstrap, and then updated the patch a new version.

And this version pass bootstrap and regtest on ppc64{,le}, x86_64.

The major change is 'move_sub_blocks' only handles the case when
the block size can be move by same submode, or say (size % sub_size)
is 0.  If no objection, I would committed the new version.

BR,
Jeff (Jiufu)

gcc/ChangeLog:

	* cfgexpand.cc (expand_used_vars): Update to mark 
DECL_USEDBY_RETURN_P

for returns.
* expr.cc (move_sub_blocks): New function.
(expand_assignment): Update assignment code about returns/parameters.
* function.cc (assign_parm_setup_block): Update to mark
DECL_REGS_TO_STACK_P for parameter.
* tree-core.h (struct tree_decl_common): Add comment.
* tree.h (DECL_USEDBY_RETURN_P): New define.
(DECL_REGS_TO_STACK_P): New define.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr65421-1.c: New test.
* gcc.target/powerpc/pr65421-2.c: New test.

I don't think this was ever explicitly ACK'd.  OK for the trunk.


Thanks so much! And sorry for the late reply.
I'm trying to investigate another patch that may fix other PRs and also 
could

handle this issue.  So, I may suspend this for the new patch.


BR,
Jeff (Jiufu Guo)



jeff


Re: [PATCH V4 2/2] rs6000: use li;x?oris to build constant

2023-05-16 Thread guojiufu via Gcc-patches

Hi,

On 2023-05-15 14:53, Kewen.Lin wrote:

Hi Jeff,

on 2022/12/12 09:38, Jiufu Guo wrote:

Hi,

For constant C:
If '(c & 0xULL) == 0x' or say:
32(1) || 1(0) || 15(x) || 16(0), we could use "lis; xoris" to build.

Here N(M) means N continuous bit M, x for M means it is ok for either
1 or 0; '||' means concatenation.

This patch update rs6000_emit_set_long_const to support those 
constants.


Compare with previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607618.htm
This patch fix conflicts with trunk.

Bootstrap and regtest pass on ppc64{,le}.

Is this ok for trunk?


OK for trunk, thanks for improving this.

btw, the test case needs to be updated a bit as the function names in 
the
context changed upstream, please ensure it's tested well before 
committing,

thanks!


Yeap! Retested and verified.
Thanks so much for your always insight review and helpful comments!

Committed via r14-923-g5eb7d560626e42.

BR,
Jeff (Jiufu)





BR,
Jeff (Jiufu)


PR target/106708

gcc/ChangeLog:

* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add to build
constants through "lis; xoris".


Maybe s/Add to build/Support building/

Yes :)



BR,
Kewen



gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr106708.c: Add test function.

---
 gcc/config/rs6000/rs6000.cc |  7 +++
 gcc/testsuite/gcc.target/powerpc/pr106708.c | 10 +-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 8c1192a10c8..1138d5e8cd4 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -10251,6 +10251,13 @@ rs6000_emit_set_long_const (rtx dest, 
HOST_WIDE_INT c)

   if (ud1 != 0)
emit_move_insn (dest, gen_rtx_IOR (DImode, temp, GEN_INT (ud1)));
 }
+  else if (ud4 == 0x && ud3 == 0x && !(ud2 & 0x8000) && ud1 
== 0)

+{
+  /* lis; xoris */
+  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+  emit_move_insn (temp, GEN_INT (sext_hwi ((ud2 | 0x8000) << 16, 
32)));
+  emit_move_insn (dest, gen_rtx_XOR (DImode, temp, GEN_INT 
(0x8000)));

+}
   else if (ud4 == 0x && ud3 == 0x && (ud1 & 0x8000))
 {
   /* li; xoris */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr106708.c 
b/gcc/testsuite/gcc.target/powerpc/pr106708.c

index dc9ceda8367..a015c71e630 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr106708.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr106708.c
@@ -4,7 +4,7 @@
 /* { dg-require-effective-target has_arch_ppc64 } */

 long long arr[]
-  = {0x7cdeab55LL, 0x98765432LL, 0xabcdLL};
+= {0x7cdeab55LL, 0x98765432LL, 0xabcdLL, 
0x6543LL};


 void __attribute__ ((__noipa__)) lixoris (long long *arg)
 {
@@ -27,6 +27,13 @@ void __attribute__ ((__noipa__)) lisrldicl (long 
long *arg)

 /* { dg-final { scan-assembler-times {\mlis .*,0xabcd\M} 1 } } */
 /* { dg-final { scan-assembler-times {\mrldicl .*,0,32\M} 1 } } */

+void __attribute__ ((__noipa__)) lisxoris (long long *arg)
+{
+  *arg = 0x6543LL;
+}
+/* { dg-final { scan-assembler-times {\mlis .*,0xe543\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mxoris .*0x8000\M} 1 } } */
+
 int
 main ()
 {
@@ -35,6 +42,7 @@ main ()
   lixoris (a);
   lioris (a + 1);
   lisrldicl (a + 2);
+  lisxoris (a + 3);
   if (__builtin_memcmp (a, arr, sizeof (arr)) != 0)
 __builtin_abort ();
   return 0;


Re: [PATCH V3] rs6000: Load high and low part of 64bit constant independently

2023-05-07 Thread guojiufu via Gcc-patches

Hi,

On 2023-04-26 17:35, Kewen.Lin wrote:

Hi Jeff,

on 2023/1/4 14:51, Jiufu Guo wrote:

Hi,

Compare with previous version, this patch updates the comments only.
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608293.html

For a complicate 64bit constant, below is one instruction-sequence to
build:
lis 9,0x800a
ori 9,9,0xabcd
sldi 9,9,32
oris 9,9,0xc167
ori 9,9,0xfa16

while we can also use below sequence to build:
lis 9,0xc167
lis 10,0x800a
ori 9,9,0xfa16
ori 10,10,0xabcd
rldimi 9,10,32,0
This sequence is using 2 registers to build high and low part firstly,
and then merge them.

In parallel aspect, this sequence would be faster. (Ofcause, using 1 
more

register with potential register pressure).

The instruction sequence with two registers for parallel version can 
be

generated only if can_create_pseudo_p.  Otherwise, the one register
sequence is generated.

Bootstrap and regtest pass on ppc64{,le}.
Is this ok for trunk?


OK for trunk, thanks for the improvement!


Thanks! Committed via r14-555-gb05b529125fa51.

BR,
Jeff (Jiufu)



BR,
Kewen




BR,
Jeff(Jiufu)


gcc/ChangeLog:

* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Generate
more parallel code if can_create_pseudo_p.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/parall_5insn_const.c: New test.



Re: [PATCH V5] Use reg mode to move sub blocks for parameters and returns

2023-05-03 Thread guojiufu via Gcc-patches

Hi,

On 2023-05-01 23:52, Segher Boessenkool wrote:

Hi!

On Fri, Mar 17, 2023 at 11:39:52AM +0800, Jiufu Guo wrote:

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr65421-1.c: New test.
* gcc.target/powerpc/pr65421.c: New test.


Please name the tests something else?  -1.c and -2.c maybe.  Or
something more inspired.  Just not something that makes the less
important of the (so far) two testcases look more important than it is.


Right!  Thanks for pointing out this!

BR,
Jeff (Jiufu)



The testcases are fine otherwise, thanks!


Segher


Re: [PATCH V5] Use reg mode to move sub blocks for parameters and returns

2023-05-03 Thread guojiufu via Gcc-patches

Hi,

On 2023-05-01 03:00, Jeff Law wrote:

On 3/16/23 21:39, Jiufu Guo wrote:

Hi,

When assigning a parameter to a variable, or assigning a variable to
return value with struct type, and the parameter/return is passed
through registers.
For this kind of case, it would be better to use the nature mode of
the registers to move the content for the assignment.

As the example code (like code in PR65421):

typedef struct SA {double a[3];} A;
A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
A ret_arg (A a) {return a;} // just empty fun body
void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)

Comparing with previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609394.html
This version refine code to eliminated reductant code in  the sub
routine "move_sub_blocks".

Bootstrap and regtest pass on ppc64{,le}.
Is this ok for trunk?


...


diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db99..97a7be9542e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -5559,6 +5559,41 @@ mem_ref_refers_to_non_mem_p (tree ref)
return non_mem_decl_p (base);
  }
  +/* Sub routine of expand_assignment, invoked when assigning from a
+   parameter or assigning to a return val on struct type which may
+   be passed through registers.  The mode of register is used to
+   move the content for the assignment.
+
+   This routine generates code for expression FROM which is BLKmode,
+   and move the generated content to TO_RTX by su-blocks in SUB_MODE. 
 */

+
+static void
+move_sub_blocks (rtx to_rtx, tree from, machine_mode sub_mode)
+{
+  gcc_assert (MEM_P (to_rtx));
+
+  HOST_WIDE_INT size = MEM_SIZE (to_rtx).to_constant ();

Consider the case of a BLKmode return value.  Isn't TO_RTX in this
case a BLKmode object?


Thanks for this question!

Yes, the mode of TO_RTX is BLKmode.
As we know, when the function returns via registers, the mode of
the `return-rtx` could also be BLKmode.  This patch is going to
improve these kinds of cases.

For example:
```
typedef struct FLOATS
{
  double a[3];
} FLOATS;
FLOATS ret_arg_pt (FLOATS *a){return *a;}
```

D.3952 = *a_2(D); //this patch enhance this assignment
return D.3952;

The mode is BLKmode for the rtx of `D.3952` is BLKmode, and the
rtx for "DECL_RESULT(current_function_decl)".  And the DECL_RESULT
represents the return registers.

BR,
Jeff (Jiufu)


It looks pretty good at this point.

jeff


Re: [PATCH] PR testsuite/106879 FAIL: gcc.dg/vect/bb-slp-layout-19.c on powerpc64

2023-04-19 Thread guojiufu via Gcc-patches

Hi Kewen,

On 2023-04-19 10:53, Kewen.Lin wrote:

Hi Jeff,

on 2023/4/19 10:03, Jiufu Guo wrote:

Hi,

On P7, option -mno-allow-movmisalign is added during testing, which
prevents slp happen on the case.

Like Like PR65484 and PR87306, this patch use vect_hw_misalig to guard
  Dup like...  ~~ missing the last 
character n.


Always thanks for your helpful catching and comments!
Committed via r14-105-g57e7229a29ca0e.

BR,
Jeff (Jiufu)




the case on powerpc targets.

Tested on ppc64{le,} and x86_64.
Is this ok for trunk?

BR,
Jeff (Jiufu)

gcc/testsuite/ChangeLog:

PR testsuite/106879
* gcc.dg/vect/bb-slp-layout-19.c: Modify to guard the check with
vect_hw_misalig on POWERs.

...   ~ Same here.

OK for trunk with these nits fixed, thanks!

BR,
Kewen



---
 gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c

index f075a83a25b..faf98e8d3c0 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
@@ -31,4 +31,9 @@ void f()
   e[3] = b3;
 }

-/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = 
VEC_PERM_EXPR" 3 "slp1" { target { vect_int_mult && vect_perm } } } } 
*/

+/* On older powerpc hardware (POWER7 and earlier), the default flag
+   -mno-allow-movmisalign prevents vectorization.  On POWER8 and 
later,

+   when vect_hw_misalign is true, vectorization occurs.  For other
+   targets, ! vect_no_align is a sufficient test.  */
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = 
VEC_PERM_EXPR" 3 "slp1" { target { { vect_int_mult && vect_perm } && { 
{ ! powerpc*-*-* } || { vect_hw_misalign } } } } } } */


Re: [PATCH] testsuite: update builtins-5-p9-runnable.c for BE

2023-04-16 Thread guojiufu via Gcc-patches

On 2023-04-14 17:09, Kewen.Lin wrote:

Hi Jeff,

on 2023/4/14 16:01, guojiufu wrote:

On 2023-04-14 15:30, Jiufu Guo wrote:

Hi,

As PR108809 mentioned, vec_xl_len_r and vec_xst_len_r are tested
in gcc.target/powerpc/builtins-5-p9-runnable.c.
The vector operand of these two bifs are different from the view
of v16_int8 between BE and LE, even it is same from the view of
128bits(uint128/V1TI).

The test case gcc.target/powerpc/builtins-5-p9-runnable.c was
written for LE environment, this patch updates it for BE.

Tested on ppc64 BE and LE.
Is this ok for trunk?

BR,
Jeff (Jiufu)

gcc/testsuite/ChangeLog:


    PR target/108809


s/target/testsuite/


* gcc.target/powerpc/builtins-5-p9-runnable.c: Updated.




s/Updated/Update for BE/

OK with these two nits fixed, thanks!


Thanks for the very helpful comments!
Committed via r13-7202-ga1f25e04b8d10b.


BR,
Jeff (Jiufu)



BR,
Kewen


Re: [PATCH] testsuite: update builtins-5-p9-runnable.c for BE

2023-04-14 Thread guojiufu via Gcc-patches

On 2023-04-14 15:30, Jiufu Guo wrote:

Hi,

As PR108809 mentioned, vec_xl_len_r and vec_xst_len_r are tested
in gcc.target/powerpc/builtins-5-p9-runnable.c.
The vector operand of these two bifs are different from the view
of v16_int8 between BE and LE, even it is same from the view of
128bits(uint128/V1TI).

The test case gcc.target/powerpc/builtins-5-p9-runnable.c was
written for LE environment, this patch updates it for BE.

Tested on ppc64 BE and LE.
Is this ok for trunk?

BR,
Jeff (Jiufu)

gcc/testsuite/ChangeLog:


PR target/108809

* gcc.target/powerpc/builtins-5-p9-runnable.c: Updated.


Add missing PR number.

BR,
Jeff (Jiufu)



---
 .../powerpc/builtins-5-p9-runnable.c  | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/gcc/testsuite/gcc.target/powerpc/builtins-5-p9-runnable.c
b/gcc/testsuite/gcc.target/powerpc/builtins-5-p9-runnable.c
index 14e935513fe..1a5f1d6383a 100644
--- a/gcc/testsuite/gcc.target/powerpc/builtins-5-p9-runnable.c
+++ b/gcc/testsuite/gcc.target/powerpc/builtins-5-p9-runnable.c
@@ -78,8 +78,13 @@ int main() {
size = 8;
vec_uc_result1 = vec_xl_len_r(data_uc, size);

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){8, 7, 6, 5, 4, 3, 2, 1,
 0, 0, 0, 0, 0, 0, 0, 0,};
+#else
+   vec_uc_expected1 = (vector unsigned char){0, 0, 0, 0, 0, 0, 0, 0,
+1, 2, 3, 4, 5, 6, 7, 8,};
+#endif

if (result_wrong (vec_uc_expected1, vec_uc_result1))
  {
@@ -107,8 +112,13 @@ int main() {
size = 4;
vec_uc_result1 = vec_xl_len_r(data_uc, size);

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 4, 3, 2, 1, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0 
};

+#else
+   vec_uc_expected1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,
+  0, 0, 0, 0, 1, 2, 3, 4 
};

+#endif

if (result_wrong (vec_uc_expected1, vec_uc_result1))
  {
@@ -135,8 +145,13 @@ int main() {
size = 2;
vec_uc_result1 = vec_xl_len_r(data_uc, size);

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 2, 1, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0 
};

+#else
+   vec_uc_expected1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,
+  0, 0, 0, 0, 0, 0, 1, 2 
};

+#endif

if (result_wrong (vec_uc_expected1, vec_uc_result1))
  {
@@ -231,8 +246,13 @@ int main() {
  }

/* VEC_XST_LEN_R */
+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 16, 15, 14, 13, 12, 11, 
10, 9,

  8, 7, 6, 5, 4, 3, 2, 1 };
+#else
+   vec_uc_expected1 = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
+ 9, 10, 11, 12, 13, 14, 15, 16 };
+#endif
store_data_uc = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
   9, 10, 11, 12, 13, 14, 15, 16 };
vec_uc_result1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,
@@ -265,8 +285,13 @@ int main() {
 #endif
  }

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 2, 1, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0 
};

+#else
+   vec_uc_expected1 = (vector unsigned char){ 15, 16, 0, 0, 0, 0, 0, 
0,
+  0, 0, 0, 0, 0, 0, 0, 0 
};

+#endif
store_data_uc = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
   9, 10, 11, 12, 13, 14, 15, 16 };
vec_uc_result1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,
@@ -299,8 +324,13 @@ int main() {
 #endif
  }

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 16, 15, 14, 13, 12, 11, 
10, 9,
   8, 7, 6, 5, 4, 3, 2, 1 
};

+#else
+   vec_uc_expected1 = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
+ 9, 10, 11, 12, 13, 14, 15, 16 };
+#endif
store_data_uc = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
   9, 10, 11, 12, 13, 14, 15, 16 };
vec_uc_result1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,
@@ -333,8 +363,13 @@ int main() {
 #endif
  }

+#ifdef __LITTLE_ENDIAN__
vec_uc_expected1 = (vector unsigned char){ 14, 13, 12, 11, 10, 9, 
8, 7,
   6, 5, 4, 3, 2, 1, 0, 0 
};

+#else
+   vec_uc_expected1 = (vector unsigned char){ 3, 4, 5, 6, 7, 8, 9, 10,
+  11, 12, 13, 14, 15, 16, 
0, 0 };

+#endif
store_data_uc = (vector unsigned char){ 1, 2, 3, 4, 5, 6, 7, 8,
   9, 10, 11, 12, 13, 14, 15, 16 };
vec_uc_result1 = (vector unsigned char){ 0, 0, 0, 0, 0, 0, 0, 0,


Re: [PATCH] testsuite: filter out warning noise for CWE-1341 test

2023-04-13 Thread guojiufu via Gcc-patches

Hi,

On 2023-04-13 20:08, Segher Boessenkool wrote:

On Thu, Apr 13, 2023 at 07:39:01AM +, Richard Biener wrote:

On Thu, 13 Apr 2023, Jiufu Guo wrote:
I think this should be fixed in the analyzer, "stripping" malloc
tracking from fopen/fclose since it does this manually.  I've adjusted
the bug accordingly.


Yeah.


> > +/* This case checks double-fclose only, suppress other warning.  */
> > +/* { dg-additional-options -Wno-analyzer-double-free } */


So please add "(PR108722)" or such to the comment here?  That is enough
for future people to see if this is still necessary, to maybe remove it
from the testcase here, but certainly not cargo-cult it to other
testcases!


Good suggestions, thanks!
Committed via r13-7176-gedc6659c97c4a7.

BR,
Jeff (Jiufu)



Thanks,


Segher


Re: [PATCH] testsuite: update requires for powerpc/float128-cmp2-runnable.c

2023-04-13 Thread guojiufu via Gcc-patches

Hi,

On 2023-04-12 20:47, Kewen.Lin wrote:

Hi Segher & Jeff,

on 2023/4/11 23:13, Segher Boessenkool wrote:

On Tue, Apr 11, 2023 at 05:40:09PM +0800, Kewen.Lin wrote:

on 2023/4/11 17:14, guojiufu wrote:

Thanks for raising this concern.
The behavior to check about bif on FLOAT128_HW and emit an error 
message for
requirements on quad-precision is added in gcc12. This is why gcc12 
fails to

compile the case on -m32.

Before gcc12, altivec_resolve_overloaded_builtin will return the 
overloaded

result directly, and does not check more about the result function.


Thanks for checking, I wonder which commit caused this behavior 
change and what's
the underlying justification?  I know there is one new bif handling 
framework


Answered this question by myself with some diggings, test case
float128-cmp2-runnable.c started to fail from r12-5752-gd08236359eb229 
which
exactly makes new bif framework start to take effect and the reason why 
the

behavior changes is the condition change from **TARGET_P9_VECTOR** to
**TARGET_FLOAT128_HW**.

With r12-5751-gc9dd01314d8467 (still old bif framework):

$ grep -r scalar_cmp_exp_qp gcc/config/rs6000/rs6000-builtin.def
BU_P9V_VSX_2 (VSCEQPGT, "scalar_cmp_exp_qp_gt", CONST,  
xscmpexpqp_gt_kf)
BU_P9V_VSX_2 (VSCEQPLT, "scalar_cmp_exp_qp_lt", CONST,  
xscmpexpqp_lt_kf)
BU_P9V_VSX_2 (VSCEQPEQ, "scalar_cmp_exp_qp_eq", CONST,  
xscmpexpqp_eq_kf)

BU_P9V_VSX_2 (VSCEQPUO, "scalar_cmp_exp_qp_unordered",  CONST,
xscmpexpqp_unordered_kf)
BU_P9V_OVERLOAD_2 (VSCEQPGT,"scalar_cmp_exp_qp_gt")
BU_P9V_OVERLOAD_2 (VSCEQPLT,"scalar_cmp_exp_qp_lt")
BU_P9V_OVERLOAD_2 (VSCEQPEQ,"scalar_cmp_exp_qp_eq")
BU_P9V_OVERLOAD_2 (VSCEQPUO,"scalar_cmp_exp_qp_unordered")

There were only 13 bifs requiring TARGET_FLOAT128_HW in old bif 
framework.


$ grep ^BU_FLOAT128_HW gcc/config/rs6000/rs6000-builtin.def
BU_FLOAT128_HW_VSX_1 (VSEEQP,   "scalar_extract_expq",  CONST,  
xsxexpqp_kf)
BU_FLOAT128_HW_VSX_1 (VSESQP,   "scalar_extract_sigq",  CONST,  
xsxsigqp_kf)
BU_FLOAT128_HW_VSX_1 (VSTDCNQP, "scalar_test_neg_qp",   CONST,  
xststdcnegqp_kf)
BU_FLOAT128_HW_VSX_2 (VSIEQP,   "scalar_insert_exp_q",  CONST,  
xsiexpqp_kf)
BU_FLOAT128_HW_VSX_2 (VSIEQPF,  "scalar_insert_exp_qp", CONST,  
xsiexpqpf_kf)

BU_FLOAT128_HW_VSX_2 (VSTDCQP, "scalar_test_data_class_qp", CONST,
 xststdcqp_kf)
BU_FLOAT128_HW_1 (SQRTF128_ODD,  "sqrtf128_round_to_odd",  FP, 
sqrtkf2_odd)
BU_FLOAT128_HW_1 (TRUNCF128_ODD, "truncf128_round_to_odd", FP, 
trunckfdf2_odd)
BU_FLOAT128_HW_2 (ADDF128_ODD,   "addf128_round_to_odd",   FP, 
addkf3_odd)
BU_FLOAT128_HW_2 (SUBF128_ODD,   "subf128_round_to_odd",   FP, 
subkf3_odd)
BU_FLOAT128_HW_2 (MULF128_ODD,   "mulf128_round_to_odd",   FP, 
mulkf3_odd)
BU_FLOAT128_HW_2 (DIVF128_ODD,   "divf128_round_to_odd",   FP, 
divkf3_odd)
BU_FLOAT128_HW_3 (FMAF128_ODD,   "fmaf128_round_to_odd",   FP, 
fmakf4_odd)


Starting from r12-5752-gd08236359eb229, these
scalar_cmp_exp_qp_{gt,lt,eq,unordered}
bifs were put under stanza ieee128-hw, it makes ieee128-hw to have 17 
bifs,

comparing to the previous, the extra four ones were exactly these
scalar_cmp_exp_qp_{gt,lt,eq,unordered}.

introduced in gcc12, not sure the checking condition was changed 
together or by
a standalone commit.  Anyway, apparently the conditions for the 
support of these
bifs are different on gcc-11 and gcc-12, I wonder why it changed.  As 
mentioned
above, PR108758's c#1 said this case (bifs) work well on gcc-11, I 
suspected the

condition change was an overkill, that's why I asked.


It almost certainly was an oversight.  The new builtin framework 
changed

so many things, there was bound to be some breakage to go with all the
good things it brought.


Yeah, as the above findings, also I found that
r12-3126-g2ed356a4c9af06 introduced
power9 related stanzas and r12-3167-g2f9489a1009d98 introduced 
ieee128-hw stanza

including these four bifs, both of them don't have any notes on why we
would change
the condition for these scalar_cmp_exp_qp_{gt,lt,eq,unordered} from
power9-vector to
ieee128-hw, so I think it's just an oversight (ieee128-hw is an
overkill comparing
to power9-vector :)).



So what is the actual thing going wrong?  QP insns work fine and are
valid on all systems and environments, BE or LE, 32-bit or 64-bit.  Of
course you cannot use the "long double" type for those everywhere, but
that is a very different thing.


The actual thing going wrong is that: the test case 
float128-cmp2-runnable.c
runs well on BE -m32 and -m64 with gcc-11, but meets failures on BE 
-m32 with
latest gcc-12 and trunk during compilation, having the error messages 
like:


gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c: In function 
'main':

gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c:155:3: error:
  '__builtin_vsx_scalar_cmp_exp_qp_eq' requires ISA 3.0 IEEE 128-bit
floating point

As scalar_cmp_exp_qp_{gt,lt,eq,unordered} requires condition 
TARGET_FLOAT128_HW

now (since new bif framework took effect).

(To be 

Re: [PATCH] testsuite: update requires for powerpc/float128-cmp2-runnable.c

2023-04-12 Thread guojiufu via Gcc-patches

Hi Mike,

On 2023-04-12 22:46, Michael Meissner wrote:

On Wed, Apr 12, 2023 at 01:31:46PM +0800, Jiufu Guo wrote:

I understand that QP insns (e.g. xscmpexpqp) is valid if the system
meets ISA3.0, no matter BE/LE, 32-bit/64-bit.
I think option -mfloat128-hardware is designed for QP insns.

While there is one issue, on BE machine, when compiling with options
"-mfloat128-hardware -m32", an error message is generated:
"error: '%<-mfloat128-hardware%>' requires '-m64'"

(I'm wondering if we need to relax this limitation.)


In the past, the machine independent portion of the compiler demanded 
that for
scalar mode, there be an integer mode of the same size, since sometimes 
moves
are converted to using an int RTL mode.  Since we don't have TImode 
support in
32-bit, you would get various errors because something tried to do a 
TImode

move for KFmode types, and the TImode wasn't available.

If somebody wants to verify that this now works on 32-bit and/or 
implements

TImode on 32-bit, then we can relax the restriction.


Thanks a lot for pointing out this!

BR,
Jeff (Jiufu)


Re: [PATCH] testsuite: update requires for powerpc/float128-cmp2-runnable.c

2023-04-11 Thread guojiufu via Gcc-patches

Hi Kewen,

Thanks a lot for your very helpful comments!

On 2023-04-10 17:26, Kewen.Lin wrote:

Hi Jeff,

on 2023/4/10 10:09, Jiufu Guo via Gcc-patches wrote:

Hi,

In this test case (float128-cmp2-runnable.c), the instruction
xscmpexpqp is used to support a few builtins e.g.
__builtin_vsx_scalar_cmp_exp_qp_eq on _Float128.
This instruction handles the whole 128bits of the vector, and
it is guarded by [ieee128-hw].


The instruction xscmpexpqp is guarded with TARGET_P9_VECTOR,

(define_insn "*xscmpexpqp"
  [(set (match_operand:CCFP 0 "cc_reg_operand" "=y")
(compare:CCFP
	 (unspec:IEEE128 [(match_operand:IEEE128 1 "altivec_register_operand" 
"v")

  (match_operand:IEEE128 2 "altivec_register_operand" 
"v")]
  UNSPEC_VSX_SCMPEXPQP)
 (match_operand:SI 3 "zero_constant" "j")))]
  "TARGET_P9_VECTOR"
  "xscmpexpqp %0,%1,%2"
  [(set_attr "type" "fpcompare")])

[ieee128-hw] is used for guarding those bifs, so the above
statement doesn't quite match the fact.



Agree, I'm wondering if P9_VECTOR is perfect here, even if it indicates 
the ISA

which contains xscmpexpqp. Let me have more checks.


PR108758 said this case doesn't fail with gcc-10 and gcc-11,
I wonder why it changes from gcc-12?  The above define_insn
shows the underlying insns for these bifs just requires the
condition power9-vector.  Could you have a further check?
Thanks.


Thanks for raising this concern.
The behavior to check about bif on FLOAT128_HW and emit an error message 
for
requirements on quad-precision is added in gcc12. This is why gcc12 
fails to

compile the case on -m32.

Before gcc12, altivec_resolve_overloaded_builtin will return the 
overloaded

result directly, and does not check more about the result function.



btw, please add a PR marker for PR108758.


Sure,  thanks for catching this!


BR,
Jeff (Jiufu)



BR,
Kewen


So, we may update the testcase to require ppc_float128_hw.

Tested on ppc64 both BE and LE.
Is this ok for trunk?

BR,
Jeff (Jiufu)

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/float128-cmp2-runnable.c: Update requires.

---
 gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c 
b/gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c

index d376a3ca68e..91287c0fb7a 100644
--- a/gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c
+++ b/gcc/testsuite/gcc.target/powerpc/float128-cmp2-runnable.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-require-effective-target ppc_float128_sw } */
+/* { dg-require-effective-target ppc_float128_hw } */
 /* { dg-require-effective-target p9vector_hw } */
 /* { dg-options "-O2 -mdejagnu-cpu=power9 " } */



Re: [PATCH] loading float member of parameter stored via int registers

2022-12-21 Thread guojiufu via Gcc-patches

Hi,

On 2022-12-21 15:30, Richard Biener wrote:

On Wed, 21 Dec 2022, Jiufu Guo wrote:


Hi,

This patch is fixing an issue about parameter accessing if the
parameter is struct type and passed through integer registers, and
there is floating member is accessed. Like below code:

typedef struct DF {double a[4]; long l; } DF;
double foo_df (DF arg){return arg.a[3];}

On ppc64le, with trunk gcc, "std 6,-24(1) ; lfd 1,-24(1)" is
generated.  While instruction "mtvsrd 1, 6" would be enough for
this case.


So why do we end up spilling for PPC?


Good question! According to GCC source code (in function.cc/expr.cc),
it is common behavior: using "word_mode" to store the parameter to 
stack,

And using the field's mode (e.g. float mode) to load from the stack.
But with some tries, I fail to construct cases on many platforms.
So, I convert the fix to a target hook and implemented the rs6000 part
first.



struct X { int i; float f; };

float foo (struct X x)
{
  return x.f;
}

does pass the structure in $RDI on x86_64 and we manage (with
optimization, with -O0 we spill) to generate

shrq$32, %rdi
movd%edi, %xmm0

and RTL expansion generates

(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:DI 83 [ x ])
(reg:DI 5 di [ x ])) "t.c":4:1 -1
 (nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (parallel [
(set (reg:DI 85)
(ashiftrt:DI (reg/v:DI 83 [ x ])
(const_int 32 [0x20])))
(clobber (reg:CC 17 flags))
]) "t.c":5:11 -1
 (nil))
(insn 7 6 8 2 (set (reg:SI 86)
(subreg:SI (reg:DI 85) 0)) "t.c":5:11 -1
 (nil))

I would imagine that for the ppc case we only see the subreg here
which should be even easier to optimize.

So how's this not fixable by providing proper patterns / subreg
capabilities?  Looking a bit at the RTL we have the issue might
be that nothing seems to handle CSE of



This case is also related to 'parameter on struct', PR89310 is
just for this case. On trunk, it is fixed.
One difference: the parameter is in DImode, and passed via an
integer register for "{int i; float f;}".
But for "{double a[4]; long l;}", the parameter is in BLKmode,
and stored to stack during the argument setup.


(note 8 0 5 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 5 8 7 2 (set (mem/c:DI (plus:DI (reg/f:DI 110 sfp)
(const_int 56 [0x38])) [2 arg+24 S8 A64])
(reg:DI 6 6)) "t.c":2:23 679 {*movdi_internal64}
 (expr_list:REG_DEAD (reg:DI 6 6)
(nil)))
(note 7 5 10 2 NOTE_INSN_FUNCTION_BEG)
(note 10 7 15 2 NOTE_INSN_DELETED)
(insn 15 10 16 2 (set (reg/i:DF 33 1)
(mem/c:DF (plus:DI (reg/f:DI 110 sfp)
(const_int 56 [0x38])) [1 arg.a[3]+0 S8 A64])) 
"t.c":2:40

576 {*movdf_hardfloat64}
 (nil))

Possibly because the store and load happen in a different mode?  Can
you see why CSE doesn't handle this (producing a subreg)?  On


Yes, exactly! For "{double a[4]; long l;}", because the store and load
are using a different mode, and then CSE does not optimize it.  This
patch makes the store and load using the same mode (DImode), and then
leverage CSE to handle it.


the GIMPLE side we'd happily do that (but we don't see the argument
setup).


Thanks for your comments!


BR,
Jeff (Jiufu)



Thanks,
Richard.


This patch updates the behavior when loading floating members of a
parameter: if that floating member is stored via integer register,
then loading it as integer mode first, and converting it to floating
mode.

I also thought of a method: before storing the register to stack,
convert it to float mode first. While there are some cases that may
still prefer to keep an integer register store.

Bootstrap and regtest passes on ppc64{,le}.
I would ask for help to review for comments and if this patch is
acceptable for the trunk.


BR,
Jeff (Jiufu)

PR target/108073

gcc/ChangeLog:

* config/rs6000/rs6000.cc (TARGET_LOADING_INT_CONVERT_TO_FLOAT): New
macro definition.
(rs6000_loading_int_convert_to_float): New hook implement.
* doc/tm.texi: Regenerated.
* doc/tm.texi.in (loading_int_convert_to_float): New hook.
* expr.cc (expand_expr_real_1): Updated to use the new hook.
* target.def (loading_int_convert_to_float): New hook.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/pr102024.C: Update.
* gcc.target/powerpc/pr108073.c: New test.

---
 gcc/config/rs6000/rs6000.cc | 70 
+

 gcc/doc/tm.texi |  6 ++
 gcc/doc/tm.texi.in  |  2 +
 gcc/expr.cc | 15 +
 gcc/target.def  | 11 
 gcc/testsuite/g++.target/powerpc/pr102024.C |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr108073.c | 24 +++
 7 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c

diff --git 

Re: [PATCH 2/3]rs6000: NFC use sext_hwi to replace ((v&0xf..f)^0x80..0) - 0x80..0

2022-12-01 Thread guojiufu via Gcc-patches

On 2022-12-01 15:10, Jiufu Guo via Gcc-patches wrote:

Hi Kewen,

在 12/1/22 2:11 PM, Kewen.Lin 写道:

on 2022/12/1 13:35, Jiufu Guo wrote:

Hi Kewen,

Thanks for your quick and insight review!

在 12/1/22 1:17 PM, Kewen.Lin 写道:

Hi Jeff,

on 2022/12/1 09:36, Jiufu Guo wrote:

Hi,

This patch just uses sext_hwi to replace the expression like:
((value & 0xf..f) ^ 0x80..0) - 0x80..0 for rs6000.cc and rs6000.md.

Bootstrap & regtest pass on ppc64{,le}.
Is this ok for trunk?


You didn't say it clearly but I guessed you have grepped in the 
whole
config/rs6000 directory, right?  I noticed there are still two 
places
using this kind of expression in function 
constant_generates_xxspltiw,

but I assumed it's intentional as their types are not HOST_WIDE_INT.

gcc/config/rs6000/rs6000.cc:  short sign_h_word = ((h_word & 
0x) ^ 0x8000) - 0x8000;
gcc/config/rs6000/rs6000.cc:  int sign_word = ((word & 0x) ^ 
0x8000) - 0x8000;


If so, could you state it clearly in commit log like "with type
signed/unsigned HOST_WIDE_INT" or similar?


Good question!

And as you said sext_hwi is more for "signed/unsigned HOST_WIDE_INT".
For these two places, it seems sext_hwi is not needed actually!
And I did see why these expressions are used, may be just an 
assignment

is ok.


ah, I see.  I agree using the assignment is quite enough.  Could you
please also simplify them together?  Since they are with the form
"((value & 0xf..f) ^ 0x80..0) - 0x80..0" too, and can be refactored
in a better way.  Thanks!


Sure, I believe just "short sign_h_word = vsx_const->half_words[0];"
should be correct :-), and included in the updated patch.

Updated patch is attached,  bootstrap is on going.


Bootstrap and regtest pass on ppc64{,le}.

BR,
Jeff (Jiufu)



BR,
Jeff (Jiufu)



BR,
Kewen



Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread guojiufu via Gcc-patches




On 2/22/22 PM3:26, Richard Biener wrote:

On Tue, 22 Feb 2022, Jiufu Guo wrote:


Hi,

For constants, there are some codes to check: if it is able to put
to instruction as an immediate operand or it is profitable to load from
mem.  There are still some places that could be improved for platforms.

This patch could handle PR63281/57836.  This patch does not change
too much on the code like force_const_mem and legitimate_constant_p.
We may integrate these APIs for passes like expand/cse/combine
as a whole solution in the future (maybe better for stage1?).

Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
Thanks for comments!


I'm not sure whether we need a new hook here, but iff, then I think
whether loading a constant (from memory?) is faster or not depends
on the context.  So what's the exact situation and the two variants
you are costing against each other?  I assume (since you are
touching CSE) you are costing


Hi Richard,

Thanks for your review!

In some contexts, it may be faster to load from memory for some
constant value, and for some constant value, it would be faster
to put into immediate of few (1 or 2) instructions.

For example 0x1234567812345678, on ppc64, we may need 3 instructions
to build it, and then it would be better to put it in .rodata, and
then load it from memory.

Currently, we already have hooks TARGET_CANNOT_FORCE_CONST_MEM and
TARGET_LEGITIMATE_CONSTANT_P.

TARGET_CANNOT_FORCE_CONST_MEM is used to check if one 'rtx' can be
store into the constant pool.
On some targets (e.g. alpha), TARGET_LEGITIMATE_CONSTANT_P does the
behavior like what we expect:

I once thought to use TARGET_LEGITIMATE_CONSTANT_P too.
But in general, it seems this hook is designed to check if one
'rtx' could be used as an immediate instruction. This hook is used
in RTL passes: ira/reload. It is also used in recog.cc and expr.cc.

In other words, I feel, whether putting a constant in the constant
pool, we could check:
- If TARGET_CANNOT_FORCE_CONST_MEM returns true, we should not put
the 'constant' in the constant pool.
- If TARGET_LEGITIMATE_CONSTANT_P returns true, then the 'constant'
would be immediate of **one** instruction, and not put to constant
pool.
- If the new hook TARGET_FASTER_LOADING_CONSTANT returns true, then
the 'constant' would be stored in the constant pool.
Otherwise, it would be better to use an instructions-seq to build the
'constant'.
This is why I introduce a new hook.

We may also use the new hook at other places, e.g. expand/combining...
where is calling force_const_mem.

Any suggestions?



   (set (...) (mem))  (before CSE)

against

   (set (...) (immediate))  (what CSE does now)

vs.

   (set (...) (mem))  (original, no CSE)

?  With the new hook you are skipping _all_ of the following loops
logic which does look like a quite bad design and hack (not that
I am very familiar with the candidate / costing logic in there).


At cse_insn, in the following loop of the code, it is also testing
the constant and try to put into memory:

  else if (crtl->uses_const_pool
   && CONSTANT_P (trial)
   && !CONST_INT_P (trial)
   && (src_folded == 0 || !MEM_P (src_folded))
   && GET_MODE_CLASS (mode) != MODE_CC
   && mode != VOIDmode)
{
  src_folded = force_const_mem (mode, trial);
  if (src_folded)
{
  src_folded_cost = COST (src_folded, mode);
  src_folded_regcost = approx_reg_cost (src_folded);
}
}

This code is at the end of the loop, it would only be tested for
the next iteration. It may be better to test "if need to put the
constant into memory" for all iterations.

The current patch is adding an additional test before the loop.
I will update the patch to integrate these two places!



We already have TARGET_INSN_COST which you could ask for a cost.
Like if we'd have a single_set then just temporarily substitute
the RHS with the candidate and cost the insns and compare against
the original insn cost.  So why exactly do you need a new hook
for this particular situation?


Thanks for pointing out this! Segher also mentioned this before.
Currently, CSE is using rtx_cost. Using insn_cost to replace
rtx_cost would be a good idea for all necessary places including CSE.

For this particular case: check the cost for constants.
I did not use insn_cost. Because to use insn_cost, we may need
to create a recognizable insn temporarily, and for some kind of
constants we may need to create a sequence instructions on some
platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
sum cost of those instructions. If only create one fake
instruction, the insn_cost may not return the accurate cost either.

BR,
Jiufu



Thanks,
Richard.




BR,
Jiufu

gcc/ChangeLog:

PR target/94393
PR rtl-optimization/63281
* config/rs6000/rs6000.cc 

Re: [PATCH] Check if loading const from mem is faster

2022-02-22 Thread guojiufu via Gcc-patches

On 2022-02-23 01:30, Segher Boessenkool wrote:

Hi Jiu Fu,

On Tue, Feb 22, 2022 at 02:53:13PM +0800, Jiufu Guo wrote:

 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, 
rtx x)

 {
-  if (GET_CODE (x) == HIGH
-  && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  if (GET_CODE (x) == HIGH)
 return true;



Hi Segher,


This isn't explained anywhere.  "Update" is not enough ;-)
Thanks! I will add explanations for it.   This excludes all 'HIGH' for 
'x' code,

like function "rs6000_emit_move" also check if the code is 'HIGH'.

And on P10, I also encounter this kind of case like:
 (high:DI (symbol_ref:DI ("var_1") [flags 0xc0] var_1>))

Which fail to store into .rodata.




CSE is the pass that is most ancient and still causing problems left 
and

right.  It should be rewritten sooner rather than later.

The problem with that is that the pass does so much more than just CSE,
and we don't want to lose all those other things.  So it will be a slow
arduous affair of peeling off bits into separate passes, I think :-(


Yes, it does a lot of work. One of the additional works is checking 
'folding out

constants and putting constant in memory'.

BR,
Jiufu



Doing actual CSE without all the restrictive restrictions our pass has
historically had isn't the hard part!


Segher


Re: [PATCH] disable aggressive_loop_optimizations until niter ready

2022-01-13 Thread guojiufu via Gcc-patches

On 2022-01-03 22:30, Richard Biener wrote:

On Wed, 22 Dec 2021, Jiufu Guo wrote:


Hi,

Normaly, estimate_numbers_of_iterations get/caculate niter first,
and then invokes infer_loop_bounds_from_undefined. While in some case,
after a few call stacks, estimate_numbers_of_iterations is invoked 
before

niter is ready (e.g. before number_of_latch_executions returns).

e.g. number_of_latch_executions->...follow_ssa_edge_expr-->
  --> estimate_numbers_of_iterations --> 
infer_loop_bounds_from_undefined.


Since niter is still not computed, call to 
infer_loop_bounds_from_undefined

may not get final result.
To avoid infer_loop_bounds_from_undefined to be called with interim 
state
and avoid infer_loop_bounds_from_undefined generates interim data, 
during
niter's computing, we could disable 
flag_aggressive_loop_optimizations.


Bootstrap and regtest pass on ppc64* and x86_64.  Is this ok for 
trunk?


So this is a optimality fix, not a correctness one?  I suppose the
estimates are computed/used from scev_probably_wraps_p via
loop_exits_before_overflow and ultimatively chrec_convert.

We have a call cycle here,

estimate_numbers_of_iterations -> number_of_latch_executions ->
... -> estimate_numbers_of_iterations

where the first estimate_numbers_of_iterations will make sure
the later call will immediately return.


Hi Richard,
Thanks for your comments! And sorry for the late reply.

In estimate_numbers_of_iterations, there is a guard to make sure
the second call to estimate_numbers_of_iterations returns
immediately.

Exactly as you said, it relates to scev_probably_wraps_p calls
loop_exits_before_overflow.

The issue is: the first calling to estimate_numbers_of_iterations
maybe inside number_of_latch_executions.



I'm not sure what your patch tries to do - it seems to tackle
the case where we enter the cycle via number_of_latch_executions?
Why do we get "non-final" values?  idx_infer_loop_bounds resorts


Right, when the call cycle starts from number_of_latch_execution,
the issue may occur:

number_of_latch_executions(*1st call)->..->
analyze_scalar_evolution(IVs 1st) ->..follow_ssa_edge_expr..->
loop_exits_before_overflow->
estimate_numbers_of_iterations (*1st call)->
number_of_latch_executions(*2nd call)->..->
analyze_scalar_evolution(IVs 2nd)->..loop_exits_before_overflow-> 
estimate_numbers_of_iterations(*2nd call)


The second calling to estimate_numbers_of_iterations returns quickly.
And then, in the first calling to estimate_numbers_of_iterations,
infer_loop_bounds_from_undefined is invoked.

And, function "infer_loop_bounds_from_undefined" instantiate/analyze
SCEV for each SSA in the loop.
*Here the issue occur*, these SCEVs are based on the interim IV's
SCEV which come from "analyze_scalar_evolution(IVs 2nd)",
and those IV's SCEV will be overridden by up level
"analyze_scalar_evolution(IVs 1st)".

To handle this issue, disabling flag_aggressive_loop_optimizations
inside number_of_latch_executions is one method.
To avoid the issue in other cases, e.g. the call cycle starts from
number_of_iterations_exit or number_of_iterations_exit_assumptions,
this patch disable flag_aggressive_loop_optimizations inside
number_of_iterations_exit_assumptions.

Thanks again.

BR,
Jiufu


to SCEV and thus may recurse again - to me it would be more
logical to try avoid recursing in number_of_latch_executions by
setting ->nb_iterations to something early, maybe chrec_dont_know,
to signal we're using something we're just trying to compute.

Richard.


BR,
Jiufu

gcc/ChangeLog:

* tree-ssa-loop-niter.c (number_of_iterations_exit_assumptions):
Disable/restore flag_aggressive_loop_optimizations.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/scev-16.c: New test.

---
 gcc/tree-ssa-loop-niter.c   | 23 +++
 gcc/testsuite/gcc.dg/tree-ssa/scev-16.c | 20 
 2 files changed, 39 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/scev-16.c

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 06954e437f5..51bb501019e 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2534,18 +2534,31 @@ number_of_iterations_exit_assumptions (class 
loop *loop, edge exit,

   && !POINTER_TYPE_P (type))
 return false;

+  /* Before niter is calculated, avoid to analyze interim state. */
+  int old_aggressive_loop_optimizations = 
flag_aggressive_loop_optimizations;

+  flag_aggressive_loop_optimizations = 0;
+
   tree iv0_niters = NULL_TREE;
   if (!simple_iv_with_niters (loop, loop_containing_stmt (stmt),
  op0, , safe ? _niters : NULL, false))
-return number_of_iterations_popcount (loop, exit, code, niter);
+{
+  bool res = number_of_iterations_popcount (loop, exit, code, 
niter);
+  flag_aggressive_loop_optimizations = 
old_aggressive_loop_optimizations;

+  return res;
+}
   tree iv1_niters = NULL_TREE;
   if (!simple_iv_with_niters (loop, 

Re: [RFC] Overflow check in simplifying exit cond comparing two IVs.

2021-10-27 Thread guojiufu via Gcc-patches



I just had a test on ppc64le, this patch pass bootstrap and regtest.
Is this patch OK for trunk?

Thanks for any comments.

BR,
Jiufu

On 2021-10-18 21:37, Jiufu Guo wrote:

With reference the discussions in:
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574334.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572006.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578672.html

Base on the patches in above discussion, we may draft a patch to fix 
the

issue.

In this patch, to make sure it is ok to change '{b0,s0} op {b1,s1}' to
'{b0,s0-s1} op {b1,0}', we also compute the condition which could 
assume

both 2 ivs are not overflow/wrap: the niter "of '{b0,s0-s1} op {b1,0}'"
< the niter "of untill wrap for iv0 or iv1".

Does this patch make sense?

BR,
Jiufu Guo

gcc/ChangeLog:

PR tree-optimization/100740
* tree-ssa-loop-niter.c (number_of_iterations_cond): Add
assume condition for combining of two IVs

gcc/testsuite/ChangeLog:

* gcc.c-torture/execute/pr100740.c: New test.
---
 gcc/tree-ssa-loop-niter.c | 103 +++---
 .../gcc.c-torture/execute/pr100740.c  |  11 ++
 2 files changed, 99 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr100740.c

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 75109407124..f2987a4448d 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1863,29 +1863,102 @@ number_of_iterations_cond (class loop *loop,

  provided that either below condition is satisfied:

-   a) the test is NE_EXPR;
-   b) iv0.step - iv1.step is integer and iv0/iv1 don't overflow.
+   a) iv0.step - iv1.step is integer and iv0/iv1 don't overflow.
+   b) assumptions in below table also need to be satisfied.
+
+   | iv0 | iv1 | assum (iv0step > iv1->step;
+   The second three rows: iv0->step < iv1->step.

  This rarely occurs in practice, but it is simple enough to 
manage.  */

   if (!integer_zerop (iv0->step) && !integer_zerop (iv1->step))
 {
+  if (TREE_CODE (iv0->step) != INTEGER_CST
+ || TREE_CODE (iv1->step) != INTEGER_CST)
+   return false;
+  if (!iv0->no_overflow || !iv1->no_overflow)
+   return false;
+
   tree step_type = POINTER_TYPE_P (type) ? sizetype : type;
-  tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
-  iv0->step, iv1->step);
-
-  /* No need to check sign of the new step since below code takes 
care

-of this well.  */
-  if (code != NE_EXPR
- && (TREE_CODE (step) != INTEGER_CST
- || !iv0->no_overflow || !iv1->no_overflow))
+  tree step
+	= fold_binary_to_constant (MINUS_EXPR, step_type, iv0->step, 
iv1->step);

+
+  if (code != NE_EXPR && tree_int_cst_sign_bit (step))
return false;

-  iv0->step = step;
-  if (!POINTER_TYPE_P (type))
-   iv0->no_overflow = false;
+  bool positive0 = !tree_int_cst_sign_bit (iv0->step);
+  bool positive1 = !tree_int_cst_sign_bit (iv1->step);

-  iv1->step = build_int_cst (step_type, 0);
-  iv1->no_overflow = true;
+  /* Cases in rows 2 and 4 of above table.  */
+  if ((positive0 && !positive1) || (!positive0 && positive1))
+   {
+ iv0->step = step;
+ iv1->step = build_int_cst (step_type, 0);
+ return number_of_iterations_cond (loop, type, iv0, code, iv1,
+   niter, only_exit, every_iteration);
+   }
+
+  affine_iv i_0, i_1;
+  class tree_niter_desc num;
+  i_0 = *iv0;
+  i_1 = *iv1;
+  i_0.step = step;
+  i_1.step = build_int_cst (step_type, 0);
+  if (!number_of_iterations_cond (loop, type, _0, code, _1, 
,

+ only_exit, every_iteration))
+   return false;
+
+  affine_iv i0, i1;
+  class tree_niter_desc num_wrap;
+  i0 = *iv0;
+  i1 = *iv1;
+
+  /* Reset iv0 and iv1 to calculate the niter which cause 
overflow.  */

+  if (tree_int_cst_lt (i1.step, i0.step))
+   {
+ if (positive0 && positive1)
+   i0.step = build_int_cst (step_type, 0);
+ else if (!positive0 && !positive1)
+   i1.step = build_int_cst (step_type, 0);
+ if (code == NE_EXPR)
+   code = LT_EXPR;
+   }
+  else
+   {
+ if (positive0 && positive1)
+   i1.step = build_int_cst (step_type, 0);
+ else if (!positive0 && !positive1)
+   i0.step = build_int_cst (step_type, 0);
+ gcc_assert (code == NE_EXPR);
+ code = GT_EXPR;
+   }
+
+  /* Calculate the niter which cause overflow.  */
+  if (!number_of_iterations_cond (loop, type, , code, , 
_wrap,

+ only_exit, every_iteration))
+   return false;
+
+  /* Make assumption there is no overflow. */
+  tree assum
+   = 

Re: [PATCH] Use fold_build2 instead fold_binary for TRUTH_AND

2021-10-19 Thread guojiufu via Gcc-patches

On 2021-10-20 10:44, Andrew Pinski wrote:

On Tue, Oct 19, 2021 at 7:30 PM Jiufu Guo via Gcc-patches
 wrote:


In tree_simplify_using_condition_1, there is code which should be 
logic:

"op0 || op1"/"op0 && op1".  When creating expression for TRUTH_OR_EXPR
and TRUTH_AND_EXPR, fold_build2 would be used instead fold_binary 
which

always return NULL_TREE for this kind of expr.

Bootstrap and regtest pass on ppc and ppc64le.  Is this ok for trunk?


No, because I think it is the wrong thing to do as we will be throwing
away the result if the fold_binary is not an integer cst anyways so
creating an extra tree is a waste.


Hi Andrew,

Thanks for your great comments!  I understand your explanation.  And 
there
is already non-nullness checking and zero/nonzero cst checking as you 
said.

I agree with you now :)  because if "op0 && op1"/"op0 || op1" is able to
be folded (especially folded into a cst nonzero/zero), fold_binary is 
enough.
And then, when fold_build2 creates a tree expr on code _AND_/_OR_, it 
should

not be a cst anymore.

BR,
Jiufu





BR,
Jiufu

gcc/ChangeLog:

* tree-ssa-loop-niter.c (tree_simplify_using_condition_1): 
Replace

fold_binary with fold_build2 fir logical OR/AND.

---
 gcc/tree-ssa-loop-niter.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 75109407124..27e11a29707 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2290,12 +2290,12 @@ tree_simplify_using_condition_1 (tree cond, 
tree expr)


   /* Check whether COND ==> EXPR.  */
   notcond = invert_truthvalue (cond);
-  e = fold_binary (TRUTH_OR_EXPR, boolean_type_node, notcond, expr);
+  e = fold_build2 (TRUTH_OR_EXPR, boolean_type_node, notcond, expr);
   if (e && integer_nonzerop (e))
 return e;


We already check for non-nullness and we also check to see it is an
integer which is nonzero. So building a tree which will be thrown away
is just a waste and all.



   /* Check whether COND ==> not EXPR.  */
-  e = fold_binary (TRUTH_AND_EXPR, boolean_type_node, cond, expr);
+  e = fold_build2 (TRUTH_AND_EXPR, boolean_type_node, cond, expr);
   if (e && integer_zerop (e))
 return e;


Likewise.

Thanks,
Andrew Pinski



--
2.17.1



Re: [PATCH] testsuite: Fix gcc.dg/vect/pr101145* tests [PR101145]

2021-08-31 Thread guojiufu via Gcc-patches

On 2021-08-31 20:12, Jakub Jelinek wrote:

Hi!

I'm getting:
FAIL: gcc.dg/vect/pr101145.c scan-tree-dump-times vect "vectorized 1 
loops" 7
FAIL: gcc.dg/vect/pr101145_1.c scan-tree-dump-times vect "vectorized 1 
loops" 2
FAIL: gcc.dg/vect/pr101145_2.c scan-tree-dump-times vect "vectorized 1 
loops" 2
FAIL: gcc.dg/vect/pr101145_3.c scan-tree-dump-times vect "vectorized 1 
loops" 2

FAIL: gcc.dg/vect/pr101145.c -flto -ffat-lto-objects
scan-tree-dump-times vect "vectorized 1 loops" 7
FAIL: gcc.dg/vect/pr101145_1.c -flto -ffat-lto-objects
scan-tree-dump-times vect "vectorized 1 loops" 2
FAIL: gcc.dg/vect/pr101145_2.c -flto -ffat-lto-objects
scan-tree-dump-times vect "vectorized 1 loops" 2
FAIL: gcc.dg/vect/pr101145_3.c -flto -ffat-lto-objects
scan-tree-dump-times vect "vectorized 1 loops" 2
on i686-linux (or x86_64-linux with -m32/-mno-sse).
The problem is that those tests use dg-options, which in */vect/ 
testsuite
throws away all the carefully added default options to enable 
vectorization
on each target (and which e.g. vect_int etc. effective targets rely 
on).

The old way would be to name those tests gcc.dg/vect/O3-pr101145*,
but we can also use dg-additional-options (which doesn't throw the 
default
options, just appends to them) which is IMO better so that we don't 
have to

rename the tests.

Tested on x86_64-linux and i686-linux, ok for trunk?

2021-08-31  Jakub Jelinek  

PR tree-optimization/102072
* gcc.dg/vect/pr101145.c: Use dg-additional-options with just -O3
instead of dg-options with -O3 -fdump-tree-vect-details.
* gcc.dg/vect/pr101145_1.c: Likewise.
* gcc.dg/vect/pr101145_2.c: Likewise.
* gcc.dg/vect/pr101145_3.c: Likewise.

--- gcc/testsuite/gcc.dg/vect/pr101145.c.jj	2021-08-30 
08:36:11.295515537 +0200
+++ gcc/testsuite/gcc.dg/vect/pr101145.c	2021-08-31 14:04:35.691964573 
+0200

@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-O3 -fdump-tree-vect-details" } */
+/* { dg-additional-options "-O3" } */
 #include 

 unsigned __attribute__ ((noinline))
--- gcc/testsuite/gcc.dg/vect/pr101145_1.c.jj   2021-08-30
08:36:11.295515537 +0200
+++ gcc/testsuite/gcc.dg/vect/pr101145_1.c	2021-08-31 
14:04:55.083691474 +0200

@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-O3 -fdump-tree-vect-details" } */
+/* { dg-additional-options "-O3" } */
 #define TYPE signed char
 #define MIN -128
 #define MAX 127
--- gcc/testsuite/gcc.dg/vect/pr101145_2.c.jj   2021-08-30
08:36:11.295515537 +0200
+++ gcc/testsuite/gcc.dg/vect/pr101145_2.c	2021-08-31 
14:05:05.868539591 +0200

@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-O3 -fdump-tree-vect-details" } */
+/* { dg-additional-options "-O3" } */
 #define TYPE unsigned char
 #define MIN 0
 #define MAX 255
--- gcc/testsuite/gcc.dg/vect/pr101145_3.c.jj   2021-08-30
08:36:11.295515537 +0200
+++ gcc/testsuite/gcc.dg/vect/pr101145_3.c	2021-08-31 
14:05:17.903370103 +0200

@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-O3 -fdump-tree-vect-details" } */
+/* { dg-additional-options "-O3" } */
 #define TYPE int *
 #define MIN ((TYPE)0)
 #define MAX ((TYPE)((long long)-1))

Jakub


Hi Jakub,

Thanks for point out this!
Just find most of the cases in /vect/ are using dg-additional-options 
instead dg-options.


BR.
Jiufu Guo


Re: [PATCH] Set bound/cmp/control for until wrap loop.

2021-08-30 Thread guojiufu via Gcc-patches

On 2021-08-30 20:02, Richard Biener wrote:

On Mon, 30 Aug 2021, guojiufu wrote:


On 2021-08-30 14:15, Jiufu Guo wrote:
> Hi,
>
> In patch r12-3136, niter->control, niter->bound and niter->cmp are
> derived from number_of_iterations_lt.  While for 'until wrap condition',
> the calculation in number_of_iterations_lt is not align the requirements
> on the define of them and requirements in determine_exit_conditions.
>
> This patch calculate niter->control, niter->bound and niter->cmp in
> number_of_iterations_until_wrap.
>
> The ICEs in the PR are pass with this patch.
> Bootstrap and reg-tests pass on ppc64/ppc64le and x86.
> Is this ok for trunk?
>
> BR.
> Jiufu Guo
>
Add ChangeLog:
gcc/ChangeLog:

2021-08-30  Jiufu Guo  

PR tree-optimization/102087
* tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
Set bound/cmp/control for niter.

gcc/testsuite/ChangeLog:

2021-08-30  Jiufu Guo  

PR tree-optimization/102087
* gcc.dg/vect/pr101145_3.c: Update tests.
* gcc.dg/pr102087.c: New test.

> ---
>  gcc/tree-ssa-loop-niter.c  | 14 +-
>  gcc/testsuite/gcc.dg/pr102087.c| 25 +
>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  4 +++-
>  3 files changed, 41 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr102087.c
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index 7af92d1c893..747f04d3ce0 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -1482,7 +1482,7 @@ number_of_iterations_until_wrap (class loop *,
> tree type, affine_iv *iv0,
> affine_iv *iv1, class tree_niter_desc *niter)
>  {
>tree niter_type = unsigned_type_for (type);
> -  tree step, num, assumptions, may_be_zero;
> +  tree step, num, assumptions, may_be_zero, span;
>wide_int high, low, max, min;
>
>may_be_zero = fold_build2 (LE_EXPR, boolean_type_node, iv1->base,
> iv0->base);
> @@ -1513,6 +1513,8 @@ number_of_iterations_until_wrap (class loop *,
> tree type, affine_iv *iv0,
>   low = wi::to_wide (iv0->base);
> else
>low = min;
> +
> +  niter->control = *iv1;
>  }
>/* {base, -C} < n.  */
>else if (tree_int_cst_sign_bit (iv0->step) && integer_zerop
> (iv1->step))
> @@ -1533,6 +1535,8 @@ number_of_iterations_until_wrap (class loop *,
> tree type, affine_iv *iv0,
>   high = wi::to_wide (iv1->base);
> else
>high = max;
> +
> +  niter->control = *iv0;
>  }
>else
>  return false;


it looks like the above two should already be in effect from the
caller (guarding with integer_nozerop)?


I add them just because set these fields in one function.
Yes, they have been set in caller already,  I could remove them here.




> @@ -1556,6 +1560,14 @@ number_of_iterations_until_wrap (class loop *,
> tree type, affine_iv *iv0,
>niter->assumptions, assumptions);
>
>niter->control.no_overflow = false;
> +  niter->control.base = fold_build2 (MINUS_EXPR, niter_type,
> +   niter->control.base,
> niter->control.step);


how do we know IVn - STEP doesn't already wrap?


The last IV value is just cross the max/min value of the type
at the last iteration,  then IVn - STEP is the nearest value
to max(or min) and not wrap.


A comment might be
good to explain you're turning the simplified exit condition into

   { IVbase - STEP, +, STEP } != niter * STEP + (IVbase - STEP)

which, when mathematically looking at it makes me wonder why there's
the seemingly redundant '- STEP' term?  Also is NE_EXPR really
correct since STEP might be not 1?  Only for non equality compares
the '- STEP' should matter?


I need to add comments for this.  This is a little tricky.
The last value of the original IV just cross max/min at most one STEP,
at there wrapping already happen.
Using "{IVbase, +, STEP} != niter * STEP + IVbase" is not wrong
in the aspect of exit condition.

But this would not work well with existing code:
like determine_exit_conditions, which will convert NE_EXP to
LT_EXPR/GT_EXPR.  And so, the '- STEP' is added to adjust the
IV.base and bound, with '- STEP' the bound will be the last value
just before wrap.

Thanks again for your review!

BR.
Jiufu



Richard.


> +  span = fold_build2 (MULT_EXPR, niter_type, niter->niter,
> +fold_convert (niter_type, niter->control.step));
> +  niter->bound = fold_build2 (PLUS_EXPR, niter_type, span,
> +fold_convert (niter_type, niter->control.base));
> +  niter->bound = fold_convert (type, niter->bound);
> +  niter->cmp = NE_EXPR;
>
>return true;
> }
> diff --git a/gcc/testsuite/gcc.dg/pr102087.c
> b/gcc/testsuite/gcc.dg/pr102087.c
> new file mode 100644
> index 000..ef1f9f5cba9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr102087.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +unsigned __attribute__ ((noinline))
> +foo (int 

Re: [PATCH] Set bound/cmp/control for until wrap loop.

2021-08-30 Thread guojiufu via Gcc-patches

On 2021-08-30 14:15, Jiufu Guo wrote:

Hi,

In patch r12-3136, niter->control, niter->bound and niter->cmp are
derived from number_of_iterations_lt.  While for 'until wrap 
condition',
the calculation in number_of_iterations_lt is not align the 
requirements

on the define of them and requirements in determine_exit_conditions.

This patch calculate niter->control, niter->bound and niter->cmp in
number_of_iterations_until_wrap.

The ICEs in the PR are pass with this patch.
Bootstrap and reg-tests pass on ppc64/ppc64le and x86.
Is this ok for trunk?

BR.
Jiufu Guo


Add ChangeLog:
gcc/ChangeLog:

2021-08-30  Jiufu Guo  

PR tree-optimization/102087
* tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
Set bound/cmp/control for niter.

gcc/testsuite/ChangeLog:

2021-08-30  Jiufu Guo  

PR tree-optimization/102087
* gcc.dg/vect/pr101145_3.c: Update tests.
* gcc.dg/pr102087.c: New test.


---
 gcc/tree-ssa-loop-niter.c  | 14 +-
 gcc/testsuite/gcc.dg/pr102087.c| 25 +
 gcc/testsuite/gcc.dg/vect/pr101145_3.c |  4 +++-
 3 files changed, 41 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr102087.c

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 7af92d1c893..747f04d3ce0 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1482,7 +1482,7 @@ number_of_iterations_until_wrap (class loop *,
tree type, affine_iv *iv0,
 affine_iv *iv1, class tree_niter_desc *niter)
 {
   tree niter_type = unsigned_type_for (type);
-  tree step, num, assumptions, may_be_zero;
+  tree step, num, assumptions, may_be_zero, span;
   wide_int high, low, max, min;

   may_be_zero = fold_build2 (LE_EXPR, boolean_type_node, iv1->base, 
iv0->base);

@@ -1513,6 +1513,8 @@ number_of_iterations_until_wrap (class loop *,
tree type, affine_iv *iv0,
low = wi::to_wide (iv0->base);
   else
low = min;
+
+  niter->control = *iv1;
 }
   /* {base, -C} < n.  */
   else if (tree_int_cst_sign_bit (iv0->step) && integer_zerop 
(iv1->step))

@@ -1533,6 +1535,8 @@ number_of_iterations_until_wrap (class loop *,
tree type, affine_iv *iv0,
high = wi::to_wide (iv1->base);
   else
high = max;
+
+  niter->control = *iv0;
 }
   else
 return false;
@@ -1556,6 +1560,14 @@ number_of_iterations_until_wrap (class loop *,
tree type, affine_iv *iv0,
  niter->assumptions, assumptions);

   niter->control.no_overflow = false;
+  niter->control.base = fold_build2 (MINUS_EXPR, niter_type,
+niter->control.base, niter->control.step);
+  span = fold_build2 (MULT_EXPR, niter_type, niter->niter,
+ fold_convert (niter_type, niter->control.step));
+  niter->bound = fold_build2 (PLUS_EXPR, niter_type, span,
+ fold_convert (niter_type, niter->control.base));
+  niter->bound = fold_convert (type, niter->bound);
+  niter->cmp = NE_EXPR;

   return true;
 }
diff --git a/gcc/testsuite/gcc.dg/pr102087.c 
b/gcc/testsuite/gcc.dg/pr102087.c

new file mode 100644
index 000..ef1f9f5cba9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr102087.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+unsigned __attribute__ ((noinline))
+foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned n)
+{
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+volatile int a[1];
+unsigned b;
+int c;
+
+int
+check ()
+{
+  int d;
+  for (; b > 1; b++)
+for (c = 0; c < 2; c++)
+  for (d = 0; d < 2; d++)
+   a[0];
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_3.c
b/gcc/testsuite/gcc.dg/vect/pr101145_3.c
index 99289afec0b..40cb0240aaa 100644
--- a/gcc/testsuite/gcc.dg/vect/pr101145_3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr101145_3.c
@@ -1,5 +1,6 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-options "-O3 -fdump-tree-vect-details" } */
+
 #define TYPE int *
 #define MIN ((TYPE)0)
 #define MAX ((TYPE)((long long)-1))
@@ -10,4 +11,5 @@

 #include "pr101145.inc"

-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } 
*/

+/* pointer size may not be vectorized, checking niter is ok. */
+/* { dg-final { scan-tree-dump "Symbolic number of iterations is" 
"vect" } } */


Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-24 Thread guojiufu via Gcc-patches

On 2021-08-16 09:33, Bin.Cheng wrote:
On Wed, Aug 4, 2021 at 10:42 AM guojiufu  
wrote:



...

>> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> b/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> new file mode 100644
>> index 000..6eed3fa8aca
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.inc
>> @@ -0,0 +1,63 @@
>> +TYPE __attribute__ ((noinline))
>> +foo_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
>> +{
>> +  for (l = L_BASE; n < l; l += C)
>> +*a++ = *b++ + 1;
>> +  return l;
>> +}
>> +
>> +TYPE __attribute__ ((noinline))
>> +bar_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
>> +{
>> +  for (l = L_BASE_DOWN; l < n; l -= C)

I noticed that both L_BASE and L_BASE_DOWN are defined as l, which
makes this test a bit confusing.  Could you clean the use of l, for
example, by using an auto var for the loop index invariable?
Otherwise the patch looks good to me.  Thanks very much for the work.


Hi,

Sorry for bothering you here.
I feel this would be an approval (with the comment) already :)

With the change code to make it a little clear as:
  TYPE i;
  for (i = l; n < i; i += C)

it may be ok to commit the patch to the trunk, right?

BR,
Jiufu



Thanks,
bin

>> +*a++ = *b++ + 1;
>> +  return l;
>> +}
>> +
>> +int __attribute__ ((noinline)) neq (int a, int b) { return a != b; }
>> +
>> +int a[1000], b[1000];
>> +int fail;
>> +
>> +int

...

>> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> new file mode 100644
>> index 000..94f6b99b893
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
>> +#define TYPE signed char
>> +#define MIN -128
>> +#define MAX 127
>> +#define N_BASE (MAX - 32)
>> +#define N_BASE_DOWN (MIN + 32)
>> +
>> +#define C 3
>> +#define L_BASE l
>> +#define L_BASE_DOWN l
>> +


Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-03 Thread guojiufu via Gcc-patches

Hi,

I would like to have a ping on this.

https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574596.html

BR,
Jiufu

On 2021-07-15 08:17, guojiufu via Gcc-patches wrote:

Hi,

I would like to have an early ping on this with more mail addresses.

BR,
Jiufu.

On 2021-07-07 20:47, Jiufu Guo wrote:

Changes since v1:
* Update assumptions for niter, add more test cases check
* Use widest_int/wide_int instead mpz to do +-/
* Move some early check for quick return

For code like:
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i > val; ++i)
cnt++;
  return cnt;
}

The number of iterations should be about UINT_MAX - start.

There is function adjust_cond_for_loop_until_wrap which
handles similar work for const bases.
Like adjust_cond_for_loop_until_wrap, this patch enhance
function number_of_iterations_cond/number_of_iterations_lt
to analyze number of iterations for this kind of loop.

Bootstrap and regtest pass on powerpc64le, x86_64 and aarch64.
Is this ok for trunk?

gcc/ChangeLog:

2021-07-07  Jiufu Guo  

PR tree-optimization/101145
* tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
New function.
(number_of_iterations_lt): Invoke above function.
(adjust_cond_for_loop_until_wrap):
Merge to number_of_iterations_until_wrap.
(number_of_iterations_cond): Update invokes for
adjust_cond_for_loop_until_wrap and number_of_iterations_lt.

gcc/testsuite/ChangeLog:

2021-07-07  Jiufu Guo  

PR tree-optimization/101145
* gcc.dg/vect/pr101145.c: New test.
* gcc.dg/vect/pr101145.inc: New test.
* gcc.dg/vect/pr101145_1.c: New test.
* gcc.dg/vect/pr101145_2.c: New test.
* gcc.dg/vect/pr101145_3.c: New test.
* gcc.dg/vect/pr101145inf.c: New test.
* gcc.dg/vect/pr101145inf.inc: New test.
* gcc.dg/vect/pr101145inf_1.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr101145.c  | 187 
++

 gcc/testsuite/gcc.dg/vect/pr101145.inc|  63 
 gcc/testsuite/gcc.dg/vect/pr101145_1.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_2.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_3.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145inf.c   |  25 +++
 gcc/testsuite/gcc.dg/vect/pr101145inf.inc |  28 
 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c |  23 +++
 gcc/tree-ssa-loop-niter.c | 157 ++
 9 files changed, 463 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
b/gcc/testsuite/gcc.dg/vect/pr101145.c
new file mode 100644
index 000..74031b031cf
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
@@ -0,0 +1,187 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-O3 -fdump-tree-vect-details" } */
+#include 
+
+unsigned __attribute__ ((noinline))
+foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (UINT_MAX - 64 < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = UINT_MAX - 32;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{  // infininate
+  while (0 <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_5 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  //no loop
+  l = UINT_MAX;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (--l < 64)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = 32;
+  while (--l < n)
+*a++ = *b++ + 1;

Re: [PATCH V3] Use preferred mode for doloop IV [PR61837]

2021-07-29 Thread guojiufu via Gcc-patches

On 2021-07-27 23:40, Jeff Law wrote:

On 7/27/2021 12:27 AM, Richard Biener wrote:

On Fri, 23 Jul 2021, Jeff Law wrote:



On 7/15/2021 4:08 AM, Jiufu Guo via Gcc-patches wrote:

Refine code for V2 according to review comments:
* Use if check instead assert, and refine assert
* Use better RE check for test case, e.g. (?n)/(?p)
* Use better wording for target.def

Currently, doloop.xx variable is using the type as niter which may 
be

shorter than word size.  For some targets, it would be better to use
word size type.  For example, on 64bit system, to access 32bit 
value,

subreg maybe used.  Then using 64bit type maybe better for niter if
it can be present in both 32bit and 64bit.

This patch add target hook for querg perferred mode for doloop IV.
And update mode accordingly.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

BR.
Jiufu

gcc/ChangeLog:

2021-07-15  Jiufu Guo  

  PR target/61837
  * config/rs6000/rs6000.c (TARGET_PREFERRED_DOLOOP_MODE): New hook.
  (rs6000_preferred_doloop_mode): New hook.
  * doc/tm.texi: Regenerate.
  * doc/tm.texi.in: Add hook preferred_doloop_mode.
  * target.def (preferred_doloop_mode): New hook.
  * targhooks.c (default_preferred_doloop_mode): New hook.
  * targhooks.h (default_preferred_doloop_mode): New hook.
  * tree-ssa-loop-ivopts.c (compute_doloop_base_on_mode): New 
function.

  (add_iv_candidate_for_doloop): Call targetm.preferred_doloop_mode
  and compute_doloop_base_on_mode.

gcc/testsuite/ChangeLog:

2021-07-15  Jiufu Guo  

  PR target/61837
  * gcc.target/powerpc/pr61837.c: New test.
My first reaction was that whatever type corresponds to the target's 
word_mode
would be the right choice.  But then I remembered things like dbCC on 
m68k
which had a more limited range.  While I don't think m68k uses the 
doloop
bits, it's a clear example that the most desirable type may not 
correspond to

the word type for the target.

So my concern with this patch is its introducing more target 
dependencies into
the gimple pipeline which is generally considered undesirable from a 
design
standpoint.  Is there any way to lower from whatever type is chosen 
by ivopts
to the target's desired type at the gimple->rtl border rather than 
doing it in

ivopts?

I think that's difficult - after all we want to base other IV uses on
the doloop IV if possible.  So IMHO it's not different from IVOPTs
choosing different IVs based on RTL costing and target addressing mode
availability so I wasn't worried about those additional target
dependences at this point of the GIMPLE pipeline.

Yea, you're probably right on both accounts.   With that resolved I
think this is OK for the trunk.

Thanks for your patience Jiufu and thanks for chiming in Richi.


Thanks for all your help!

The patch was committed to r12-2585.

I notice that I ignored one guality case(gfortran.dg/guality/arg1.f90).
It becomes 'unsupported' from 'pass'.  The issue could be reproduced
on a similar test case without this patch.  Just opened PR101669 for it.


BR,
Jiufu



jeff


Re: [RFC] more no-wrap conditions for IV analyzing and scev

2021-07-22 Thread guojiufu via Gcc-patches

On 2021-06-21 20:36, Richard Biener wrote:

On Mon, 21 Jun 2021, guojiufu wrote:


On 2021-06-21 14:19, guojiufu via Gcc-patches wrote:
> On 2021-06-09 19:18, guojiufu wrote:
>> On 2021-06-09 17:42, guojiufu via Gcc-patches wrote:
>>> On 2021-06-08 18:13, Richard Biener wrote:
>>>> On Fri, 4 Jun 2021, Jiufu Guo wrote:
>>>>
>>> cut...
> cut...
>>

Besides the method in the previous mails, 
I’m thinking of another way to split loops:

foo (int *a, int *b, unsigned k, unsigned n)
{   
 while (++k != n)
   a[k] = b[k] + 1;   
} 

We may split it into:
if (k

That would be your original approach of versioning the loop.  I think
I suggested that for this scalar evolution and dataref analysis should
be enhanced to build up conditions under which IV evolutions are
affine (non-wrapping) and the versioning code in actual transforms
should then do the appropriate versioning (like the vectorizer already
does for niter analysis ->assumptions for example).


Hi Richi,

Thanks for your suggestion!

The original idea was trying to cover cases like multi-exit loops, while 
it
seems not benefited too much.  The method you said would help for common 
cases.


I'm thinking of the methods to implement this:
During scev analyzing, add more possible wrap checking (especially for 
unsigned)
for convert_affine_scev/scev_probably_wraps_p/chrec_convert_1;  
introducing

no_wrap_assumption for the conditions of no-wrapping on given chrec/iv.
And using this assumption in simple_iv_with_niters/dr_analyze_innermost.

One question is, where is the best place to add this assumption?
Is it a flexible idea to add no_wrap_assumption to affine_iv/loop, and
set the assumption when scev checks wrap?

Thanks for your suggestions!

BR.
Jiufu




Richard.


cut


Re: [PATCH V2] Use preferred mode for doloop iv [PR61837].

2021-07-15 Thread guojiufu via Gcc-patches

On 2021-07-15 14:06, Richard Biener wrote:

On Tue, 13 Jul 2021, Jiufu Guo wrote:


Major changes from v1:
* Add target hook to query preferred doloop mode.
* Recompute doloop iv base from niter under preferred mode.

Currently, doloop.xx variable is using the type as niter which may 
shorter
than word size.  For some cases, it would be better to use word size 
type.
For example, on 64bit system, to access 32bit value, subreg maybe 
used.

Then using 64bit type maybe better for niter if it can be present in
both 32bit and 64bit.

This patch add target hook for querg perferred mode for doloop iv.
And update doloop iv mode accordingly.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

BR.
Jiufu

gcc/ChangeLog:

2021-07-13  Jiufu Guo  

PR target/61837
* config/rs6000/rs6000.c (TARGET_PREFERRED_DOLOOP_MODE): New hook.
(rs6000_preferred_doloop_mode): New hook.
* doc/tm.texi: Regenerated.
* doc/tm.texi.in: Add hook preferred_doloop_mode.
* target.def (preferred_doloop_mode): New hook.
* targhooks.c (default_preferred_doloop_mode): New hook.
* targhooks.h (default_preferred_doloop_mode): New hook.
* tree-ssa-loop-ivopts.c (compute_doloop_base_on_mode): New function.
(add_iv_candidate_for_doloop): Call targetm.preferred_doloop_mode
and compute_doloop_base_on_mode.

gcc/testsuite/ChangeLog:

2021-07-13  Jiufu Guo  

PR target/61837
* gcc.target/powerpc/pr61837.c: New test.
---
 gcc/config/rs6000/rs6000.c |  9 +++
 gcc/doc/tm.texi|  4 ++
 gcc/doc/tm.texi.in |  2 +
 gcc/target.def |  7 +++
 gcc/targhooks.c|  8 +++
 gcc/targhooks.h|  2 +
 gcc/testsuite/gcc.target/powerpc/pr61837.c | 16 ++
 gcc/tree-ssa-loop-ivopts.c | 66 
+-

 8 files changed, 112 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr61837.c

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 9a5db63d0ef..444f3c49288 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1700,6 +1700,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =

 #undef TARGET_DOLOOP_COST_FOR_ADDRESS
 #define TARGET_DOLOOP_COST_FOR_ADDRESS 10

+#undef TARGET_PREFERRED_DOLOOP_MODE
+#define TARGET_PREFERRED_DOLOOP_MODE rs6000_preferred_doloop_mode
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV 
rs6000_atomic_assign_expand_fenv


@@ -27867,6 +27870,12 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }

+static machine_mode
+rs6000_preferred_doloop_mode (machine_mode)
+{
+  return word_mode;
+}
+
 /* Implement TARGET_CANNOT_SUBSTITUTE_MEM_EQUIV_P.  */

 static bool
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2a41ae5fba1..3f5881220f8 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11984,6 +11984,10 @@ By default, the RTL loop optimizer does not 
use a present doloop pattern for

 loops containing function calls or branch on table instructions.
 @end deftypefn

+@deftypefn {Target Hook} machine_mode TARGET_PREFERRED_DOLOOP_MODE 
(machine_mode @var{mode})

+This hook returns a more preferred mode or the @var{mode} itself.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_LEGITIMATE_COMBINED_INSN 
(rtx_insn *@var{insn})
 Take an instruction in @var{insn} and return @code{false} if the 
instruction

 is not appropriate as a combination of two or more instructions.  The
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f881cdabe9e..38215149a92 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7917,6 +7917,8 @@ to by @var{ce_info}.

 @hook TARGET_INVALID_WITHIN_DOLOOP

+@hook TARGET_PREFERRED_DOLOOP_MODE
+
 @hook TARGET_LEGITIMATE_COMBINED_INSN

 @hook TARGET_CAN_FOLLOW_JUMP
diff --git a/gcc/target.def b/gcc/target.def
index c009671c583..91a96150e50 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4454,6 +4454,13 @@ loops containing function calls or branch on 
table instructions.",

  const char *, (const rtx_insn *insn),
  default_invalid_within_doloop)

+DEFHOOK
+(preferred_doloop_mode,
+ "This hook returns a more preferred mode or the @var{mode} itself.",
+ machine_mode,
+ (machine_mode mode),
+ default_preferred_doloop_mode)
+
 /* Returns true for a legitimate combined insn.  */
 DEFHOOK
 (legitimate_combined_insn,
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 44a1facedcf..eb5190910dc 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -660,6 +660,14 @@ default_predict_doloop_p (class loop *loop 
ATTRIBUTE_UNUSED)

   return false;
 }

+/* By default, just use the input MODE itself.  */
+
+machine_mode
+default_preferred_doloop_mode (machine_mode mode)
+{
+  return mode;
+}
+
 /* NULL if INSN insn is valid within a low-overhead loop, otherwise 
returns

an error 

Re: [PATCH V2] Use preferred mode for doloop iv [PR61837].

2021-07-14 Thread guojiufu via Gcc-patches

On 2021-07-15 02:04, Segher Boessenkool wrote:

Hi!

On Wed, Jul 14, 2021 at 06:26:28PM +0800, guojiufu wrote:

PR target/61837


Wrong PR number?


There is a patch optimize "add -1; zero_ext; add +1" to "zero_ext" 
already.
Having this patch would help to avoid the left 'zero_ext', so, I reuse 
this

PR number.




+@deftypefn {Target Hook} machine_mode TARGET_PREFERRED_DOLOOP_MODE
(machine_mode @var{mode})
+This hook takes a @var{mode} which is the original mode of doloop IV.
+And if the target prefers other mode for doloop IV, this hook returns
the
+preferred mode.
+For example, on 64bit target, DImode may be preferred than SImode.
+This hook could return the original mode itself if the target prefer 
to

+keep the original mode.
+The origianl mode and return mode should be MODE_INT.
+@end deftypefn


(Typo, "original").  That has all the right contents, but needs someone
who is better at English than me to look at it / improve it.

+/* { dg-final {scan-rtl-dump-not "zero_extend.*doloop" 
"loop2_doloop"}

} */
+/* { dg-final {scan-rtl-dump-not "reg:SI.*doloop" "loop2_doloop" {
target lp64 } } } */


(Don't use format=flowed in your mails, or certainly not in those
containing patches -- it was rewrapped).



Oh, thanks for point out this!

If you use .* in scan REs, you should be aware that "." matches 
newlines

by default, so you can match "reg:SI" on one line and "doloop" on a
later one, in that second one.

You can write

/* { dg-final {scan-rtl-dump-not {(?p)reg:SI.*doloop} "loop2_doloop" {
target lp64 } } } */

(note: {} are much more convenient around most REs, you need a lot of
escaping without it) to get "partial newline-sensitive matching", which
is usually what you want (see "man re_syntax" for the details).


Thanks so much!  This helps me a lot about writing test cases, 
especially

on how to scan-xxx re in test case!




The generic changes look fine to me (but what do I know about Gimple!)
The rs6000 changes are fine if the rest is approved (and see the
testcase comments).  Thanks!


Thanks again!

BR,
Jiufu




Segher


Re: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-07-14 Thread guojiufu via Gcc-patches

Hi,

I would like to have an early ping on this with more mail addresses.

BR,
Jiufu.

On 2021-07-07 20:47, Jiufu Guo wrote:

Changes since v1:
* Update assumptions for niter, add more test cases check
* Use widest_int/wide_int instead mpz to do +-/
* Move some early check for quick return

For code like:
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i > val; ++i)
cnt++;
  return cnt;
}

The number of iterations should be about UINT_MAX - start.

There is function adjust_cond_for_loop_until_wrap which
handles similar work for const bases.
Like adjust_cond_for_loop_until_wrap, this patch enhance
function number_of_iterations_cond/number_of_iterations_lt
to analyze number of iterations for this kind of loop.

Bootstrap and regtest pass on powerpc64le, x86_64 and aarch64.
Is this ok for trunk?

gcc/ChangeLog:

2021-07-07  Jiufu Guo  

PR tree-optimization/101145
* tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
New function.
(number_of_iterations_lt): Invoke above function.
(adjust_cond_for_loop_until_wrap):
Merge to number_of_iterations_until_wrap.
(number_of_iterations_cond): Update invokes for
adjust_cond_for_loop_until_wrap and number_of_iterations_lt.

gcc/testsuite/ChangeLog:

2021-07-07  Jiufu Guo  

PR tree-optimization/101145
* gcc.dg/vect/pr101145.c: New test.
* gcc.dg/vect/pr101145.inc: New test.
* gcc.dg/vect/pr101145_1.c: New test.
* gcc.dg/vect/pr101145_2.c: New test.
* gcc.dg/vect/pr101145_3.c: New test.
* gcc.dg/vect/pr101145inf.c: New test.
* gcc.dg/vect/pr101145inf.inc: New test.
* gcc.dg/vect/pr101145inf_1.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr101145.c  | 187 ++
 gcc/testsuite/gcc.dg/vect/pr101145.inc|  63 
 gcc/testsuite/gcc.dg/vect/pr101145_1.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_2.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_3.c|  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145inf.c   |  25 +++
 gcc/testsuite/gcc.dg/vect/pr101145inf.inc |  28 
 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c |  23 +++
 gcc/tree-ssa-loop-niter.c | 157 ++
 9 files changed, 463 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
b/gcc/testsuite/gcc.dg/vect/pr101145.c
new file mode 100644
index 000..74031b031cf
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
@@ -0,0 +1,187 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-O3 -fdump-tree-vect-details" } */
+#include 
+
+unsigned __attribute__ ((noinline))
+foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned n)
+{
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned)
+{
+  while (UINT_MAX - 64 < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = UINT_MAX - 32;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{  // infininate
+  while (0 <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_5 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  //no loop
+  l = UINT_MAX;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned n)
+{
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned)
+{
+  while (--l < 64)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = 32;
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+
+int a[3200], b[3200];
+int fail;
+
+int
+main ()
+{
+  unsigned l, n;
+  unsigned res;
+  /* l > n*/
+  n = UINT_MAX - 64;
+  l = n + 32;
+  res = foo (a, b, l, n);
+  if (res != 0)
+fail++;
+
+  l = n;

Re: [PATCH V2] Use preferred mode for doloop iv [PR61837].

2021-07-14 Thread guojiufu via Gcc-patches

On 2021-07-14 12:40, guojiufu via Gcc-patches wrote:
Updated the patch as below:
Thanks for comments.

gcc/ChangeLog:

2021-07-13  Jiufu Guo  

PR target/61837
* config/rs6000/rs6000.c (TARGET_PREFERRED_DOLOOP_MODE): New hook.
(rs6000_preferred_doloop_mode): New hook.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in: Add hook preferred_doloop_mode.
* target.def (preferred_doloop_mode): New hook.
* targhooks.c (default_preferred_doloop_mode): New hook.
* targhooks.h (default_preferred_doloop_mode): New hook.
* tree-ssa-loop-ivopts.c (compute_doloop_base_on_mode): New function.
(add_iv_candidate_for_doloop): Call targetm.preferred_doloop_mode
and compute_doloop_base_on_mode.

gcc/testsuite/ChangeLog:

2021-07-13  Jiufu Guo  

PR target/61837
* gcc.target/powerpc/pr61837.c: New test.
---
 gcc/config/rs6000/rs6000.c | 11 
 gcc/doc/tm.texi| 10 
 gcc/doc/tm.texi.in |  2 +
 gcc/target.def | 14 +
 gcc/targhooks.c|  8 +++
 gcc/targhooks.h|  1 +
 gcc/testsuite/gcc.target/powerpc/pr61837.c | 20 +++
 gcc/tree-ssa-loop-ivopts.c | 67 +-
 8 files changed, 131 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr61837.c

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 9a5db63d0ef..3bdf0cb97a3 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1700,6 +1700,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =

 #undef TARGET_DOLOOP_COST_FOR_ADDRESS
 #define TARGET_DOLOOP_COST_FOR_ADDRESS 10

+#undef TARGET_PREFERRED_DOLOOP_MODE
+#define TARGET_PREFERRED_DOLOOP_MODE rs6000_preferred_doloop_mode
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV 
rs6000_atomic_assign_expand_fenv


@@ -27867,6 +27870,14 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }

+/* Implement TARGET_PREFERRED_DOLOOP_MODE. */
+
+static machine_mode
+rs6000_preferred_doloop_mode (machine_mode)
+{
+  return word_mode;
+}
+
 /* Implement TARGET_CANNOT_SUBSTITUTE_MEM_EQUIV_P.  */

 static bool
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2a41ae5fba1..4fb516169dc 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11984,6 +11984,16 @@ By default, the RTL loop optimizer does not use 
a present doloop pattern for

 loops containing function calls or branch on table instructions.
 @end deftypefn

+@deftypefn {Target Hook} machine_mode TARGET_PREFERRED_DOLOOP_MODE 
(machine_mode @var{mode})

+This hook takes a @var{mode} which is the original mode of doloop IV.
+And if the target prefers other mode for doloop IV, this hook returns 
the

+preferred mode.
+For example, on 64bit target, DImode may be preferred than SImode.
+This hook could return the original mode itself if the target prefer to
+keep the original mode.
+The origianl mode and return mode should be MODE_INT.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_LEGITIMATE_COMBINED_INSN (rtx_insn 
*@var{insn})
 Take an instruction in @var{insn} and return @code{false} if the 
instruction

 is not appropriate as a combination of two or more instructions.  The
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f881cdabe9e..38215149a92 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7917,6 +7917,8 @@ to by @var{ce_info}.

 @hook TARGET_INVALID_WITHIN_DOLOOP

+@hook TARGET_PREFERRED_DOLOOP_MODE
+
 @hook TARGET_LEGITIMATE_COMBINED_INSN

 @hook TARGET_CAN_FOLLOW_JUMP
diff --git a/gcc/target.def b/gcc/target.def
index c009671c583..1b6c9872807 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4454,6 +4454,20 @@ loops containing function calls or branch on 
table instructions.",

  const char *, (const rtx_insn *insn),
  default_invalid_within_doloop)

+/* Returns the machine mode which the target prefers for doloop IV.  */
+DEFHOOK
+(preferred_doloop_mode,
+ "This hook takes a @var{mode} which is the original mode of doloop 
IV.\n\
+And if the target prefers another mode for doloop IV, this hook returns 
the\n\

+preferred mode.\n\
+For example, on 64bit target, DImode may be preferred than SImode.\n\
+This hook could return the original mode itself if the target prefer 
to\n\

+keep the original mode.\n\
+The original mode and return mode should be MODE_INT.",
+ machine_mode,
+ (machine_mode mode),
+ default_preferred_doloop_mode)
+
 /* Returns true for a legitimate combined insn.  */
 DEFHOOK
 (legitimate_combined_insn,
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 44a1facedcf..eb5190910dc 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -660,6 +660,14 @@ default_predict_doloop_p (class loop *loop 
ATTRIBUTE_UNUSED)

   return false;
 }

+/* By default, just use the input MODE itself.  */

Re: [PATCH V2] Use preferred mode for doloop iv [PR61837].

2021-07-13 Thread guojiufu via Gcc-patches

On 2021-07-14 04:50, Segher Boessenkool wrote:

Hi!

On Tue, Jul 13, 2021 at 08:50:46PM +0800, Jiufu Guo wrote:

* doc/tm.texi: Regenerated.


Pet peeve: "Regenerate.", no "d".


Ok, yeap. While, 'Regenerate and Regenerated' were used by commits 
somewhere :)





+DEFHOOK
+(preferred_doloop_mode,
+ "This hook returns a more preferred mode or the @var{mode} itself.",
+ machine_mode,
+ (machine_mode mode),
+ default_preferred_doloop_mode)


You need a bit more description here.  What does the value it returns
mean?  If you want to say "a more preferred mode or the mode itself",
you should explain what the difference means, too.


Ok, thanks.



You also should say the hook does not need to test if things will fit,
since the generic code already does.

And say this should return a MODE_INT always -- you never test for that
as far as I can see, but you don't need to, as long as everyone does 
the

sane thing.  So just state every hok implementation should :-)


Yes, the preferred 'doloop iv mode' from targets should be a MODE_INT.
I will add comments, and update the gcc_assert you mentioned below
for this.

Thanks a lot for your comments and suggestions!

When writing words, I was always from adding/deleting and still hard to 
get

perfect ones -:(




+extern machine_mode
+default_preferred_doloop_mode (machine_mode);


One line please (this is a declaration).


+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+void foo(int *p1, long *p2, int s)
+{
+  int n, v, i;
+
+  v = 0;
+  for (n = 0; n <= 100; n++) {
+ for (i = 0; i < s; i++)
+if (p2[i] == n)
+   p1[i] = v;
+ v += 88;
+  }
+}
+
+/* { dg-final { scan-assembler-not {\mrldicl\M} } } */


That is a pretty fragile thing to test for.  It also need a line or two
of comment in the test case what this does, what kind of thing it does
not want to see.


Thanks! I will update accordingly.  And I'm thinking to add tests to 
check

doloop.xx type: no zero_extend to access subreg. This is the intention
of this patch.




+/* If PREFERRED_MODE is suitable and profitable, use the preferred
+   PREFERRED_MODE to compute doloop iv base from niter: base = niter 
+ 1.  */

+
+static tree
+compute_doloop_base_on_mode (machine_mode preferred_mode, tree niter,
+const widest_int _max)
+{
+  tree ntype = TREE_TYPE (niter);
+  tree pref_type = lang_hooks.types.type_for_mode (preferred_mode, 
1);

+
+  gcc_assert (pref_type && TYPE_UNSIGNED (ntype));


Should that be pref_type instead of ntype?  If not, write it as two
separate asserts please.


Ok, will separate as two asserts.




+static machine_mode
+rs6000_preferred_doloop_mode (machine_mode)
+{
+  return word_mode;
+}


This is fine if the generic code does the right thing if it passes say
TImode here, and if it never will pass some other mode class mode.


The generic code checks if the returned mode can works on doloop iv 
correctly,
if the preferred mode is not suitable (e.g. preferred_doloop_mode 
returns DI,
but niter is a large value in TI), then doloop.xx IV will use the 
original mode.


When a target really prefer TImode, and TImode can represent number of 
iterations,
This would still work.  In current code, word_mode is SI/DI in most 
targets, like Pmode.

On powerpc, they are DImode (for 64bit)/SImode(32bit)

Thanks again for your comments!

BR,
Jiufu




Segher


Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-13 Thread guojiufu via Gcc-patches

On 2021-07-13 23:38, Segher Boessenkool wrote:

On Mon, Jul 12, 2021 at 08:20:14AM +0200, Richard Biener wrote:

On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> Almost all targets just use Pmode, but there is no such guarantee I
> think, and esp. some targets that do not have machine insns for this
> (but want to generate different code for this anyway) can do pretty much
> anything.
>
> Maybe using just Pmode here is good enough though?

I think Pmode is a particularly bad choice and I'd prefer word_mode
if we go for any hardcoded mode.


In many important cases you use a pointer as iteration variable.

Is word_mode register size on most current targets?
Had a search on the implementation, word_mode is the mode on size 
BITS_PER_WORD
in MODE_INTs.  Actually, when targets define Pmode and BITS_PER_WORD, 
these two
macros are aligned -:), and it seems most targets define both these two 
macros.





s390x for example seems to handle
both SImode and DImode (but names the helper gen_doloop_si64
for SImode?!).


Yes, so Pmode will work fine for 390.  It would be nice if we could
allow multiple modes here, certainly.  Can we?


:), for other IVs, multiple modes are allowed to add as candidates;
while only one doloop iv is added.  Comparing the supporting more
doloop IVs, it seems changing the doloop iv mode is easy relatively
for me.  So, the patch is trying to update doloop iv.




But indeed it looks like somehow querying doloop_end
is going to be difficult since the expander doesn't have any mode,
so we'd have to actually try emit RTL here.


Or add a well-designed target macro for this.  "Which modes do we like
for IVs", perhaps?


In the new patch, a target hook preferred_doloop_mode is introduced. 
While

this hook is only for doloop iv at this time.
Maybe we could have preferred_iv_mode if needed. In the current code, 
IVs

are free to be added in different types, and the cost model is applied
to determine which IV may be better. The iv mode would be one factor for 
cost.



BR,
Jiufu




Segher


Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-13 Thread guojiufu via Gcc-patches

On 2021-07-13 23:51, Segher Boessenkool wrote:

On Tue, Jul 13, 2021 at 10:09:25AM +0800, guojiufu wrote:

>For loop looks like:
>  do ;
>  while (n-- > 0); /* while  (n-- > low); */


(This whole loop as written will be optimised away, but :-) )

At -O2, the loop is optimized away.
At -O1, the loop is there.
.cfi_startproc
addi %r3,%r3,1
.L2:
addi %r9,%r3,-1
mr %r3,%r9
andi. %r9,%r9,0xff
bne %cr0,.L2
The patch v2 
(https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574596.html)

could help it to be:
.cfi_startproc
addi %r3,%r3,1
mtctr %r3
.L2:
addi %r3,%r3,-1
bdnz .L2




There is a patch that could mitigate "-1 +1 pair" in rtl part.
https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0


Does that solve PR67288 (and its many duplicates)?

Had a test, yes, the "-1 +1" issue in PR67288 was fixed by that patch.

BR,
Jiufu.



Segher


Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-13 Thread guojiufu via Gcc-patches

On 2021-07-13 15:09, Richard Biener wrote:

On Tue, 13 Jul 2021, guojiufu wrote:


On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>>
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the
>>> >> >> >> > doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop
>>> >> >> >> > code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so
>>> >> >> >> it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being
>>> >> >> >> > able to
>>> >> >> >> > represent all IV values?  The original doloop IV will never
>>> >> >> >> > wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing
>>> >> >> >> > here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on
>>> >> >> which
>>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1;
>>> >> >> then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a
>>> >> >> small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not
>>> >> > yield
>>> >> > zero so it should still work exactly as before, no?  Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as
>>> >> below
>>> >> code:
>>> >> ```
>>> >>/* Use type in word size may fast.  */
>>> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >>   {
>>> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> niter = fold_convert (ntype, niter);
>>> >>   }
>>> >>
>>> >> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >>  build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL,
>>> >> NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   _10 = _12 + 4294967295;
>>> >>   _24 = (long unsigned int) _10;
>>> >>   doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will
>>> >> looks
>>> >> like:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >>wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>&

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:

On 2021-07-12 22:46, Richard Biener wrote:

On Mon, 12 Jul 2021, guojiufu wrote:


On 2021-07-12 18:02, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 16:57, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >
>> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >>
>> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>> >> >> is
>> >> >> hard to introspect.
>> >> >>
>> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> > represent all IV values?  The original doloop IV will never wrap
>> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >> >> > become
>> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> > but it also still will with a IV with larger type).
>> >>
>> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> >> larger type 'zero - 1' will be a very large number on larger type
>> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> value
>> >> (e.g. "0xff").
>> >
>> > But for the larger type the small type MAX + 1 fits and does not yield
>> > zero so it should still work exactly as before, no?  Of course you
>> > have to compute the + 1 in the larger type.
>> >
>> You are right, if compute the "+ 1" in the larger type it is ok, as below
>> code:
>> ```
>>/* Use type in word size may fast.  */
>> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>   {
>> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> niter = fold_convert (ntype, niter);
>>   }
>>
>> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>  build_int_cst (ntype, 1));
>>
>>
>> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>> true);
>> ```
>> The issue of this is, this code generates more stmt for doloop.xxx:
>>   _12 = (unsigned int) xx(D);
>>   _10 = _12 + 4294967295;
>>   _24 = (long unsigned int) _10;
>>   doloop.6_8 = _24 + 1;
>>
>> if use previous patch, "+ 1" on original type, then the stmts will looks
>> like:
>>   _12 = (unsigned int) xx(D);
>>   doloop.6_8 = (long unsigned int) _12;
>>
>> This is the reason for checking
>>wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>
> But this then only works when there's an upper bound on the number
> of iterations.  Note you should not use TYPE_MAX_VALUE here but
> you can instead use
>
>  wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> (TYPE_PRECISION (ntype), TYPE_SIGN (ntype;

Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
(ntype)),

TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval.  And then 
call to

wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls
wi::max_value
and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?


TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision.  At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.

Ok, get it, thanks.
I will use "widest_int::from (wi::max_value (..,..),..)".




> I think the -1 above comes from number of latch iterations vs. header
> entries - it's a common source for this kind of issues.  range analysis
> might be able to prove that we can still merge the two adds even with
> 

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:

On 2021-07-12 22:46, Richard Biener wrote:

On Mon, 12 Jul 2021, guojiufu wrote:


On 2021-07-12 18:02, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 16:57, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >
>> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >>
>> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>> >> >> is
>> >> >> hard to introspect.
>> >> >>
>> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> > represent all IV values?  The original doloop IV will never wrap
>> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >> >> > become
>> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> > but it also still will with a IV with larger type).
>> >>
>> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> >> larger type 'zero - 1' will be a very large number on larger type
>> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> value
>> >> (e.g. "0xff").
>> >
>> > But for the larger type the small type MAX + 1 fits and does not yield
>> > zero so it should still work exactly as before, no?  Of course you
>> > have to compute the + 1 in the larger type.
>> >
>> You are right, if compute the "+ 1" in the larger type it is ok, as below
>> code:
>> ```
>>/* Use type in word size may fast.  */
>> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>   {
>> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> niter = fold_convert (ntype, niter);
>>   }
>>
>> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>  build_int_cst (ntype, 1));
>>
>>
>> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>> true);
>> ```
>> The issue of this is, this code generates more stmt for doloop.xxx:
>>   _12 = (unsigned int) xx(D);
>>   _10 = _12 + 4294967295;
>>   _24 = (long unsigned int) _10;
>>   doloop.6_8 = _24 + 1;
>>
>> if use previous patch, "+ 1" on original type, then the stmts will looks
>> like:
>>   _12 = (unsigned int) xx(D);
>>   doloop.6_8 = (long unsigned int) _12;
>>
>> This is the reason for checking
>>wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>
> But this then only works when there's an upper bound on the number
> of iterations.  Note you should not use TYPE_MAX_VALUE here but
> you can instead use
>
>  wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> (TYPE_PRECISION (ntype), TYPE_SIGN (ntype;

Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
(ntype)),

TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval.  And then 
call to

wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls
wi::max_value
and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?


TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision.  At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.

Ok, get it, thanks.
I will use "widest_int::from (wi::max_value (..,..),..)".




> I think the -1 above comes from number of latch iterations vs. header
> entries - it's a common source for this kind of issues.  range analysis
> might be able to prove that we can still merge the two adds ev

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 22:46, Richard Biener wrote:

On Mon, 12 Jul 2021, guojiufu wrote:


On 2021-07-12 18:02, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 16:57, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >
>> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >>
>> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>> >> >> is
>> >> >> hard to introspect.
>> >> >>
>> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> > represent all IV values?  The original doloop IV will never wrap
>> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >> >> > become
>> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> > but it also still will with a IV with larger type).
>> >>
>> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> >> larger type 'zero - 1' will be a very large number on larger type
>> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> value
>> >> (e.g. "0xff").
>> >
>> > But for the larger type the small type MAX + 1 fits and does not yield
>> > zero so it should still work exactly as before, no?  Of course you
>> > have to compute the + 1 in the larger type.
>> >
>> You are right, if compute the "+ 1" in the larger type it is ok, as below
>> code:
>> ```
>>/* Use type in word size may fast.  */
>> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>   {
>> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> niter = fold_convert (ntype, niter);
>>   }
>>
>> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>  build_int_cst (ntype, 1));
>>
>>
>> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>> true);
>> ```
>> The issue of this is, this code generates more stmt for doloop.xxx:
>>   _12 = (unsigned int) xx(D);
>>   _10 = _12 + 4294967295;
>>   _24 = (long unsigned int) _10;
>>   doloop.6_8 = _24 + 1;
>>
>> if use previous patch, "+ 1" on original type, then the stmts will looks
>> like:
>>   _12 = (unsigned int) xx(D);
>>   doloop.6_8 = (long unsigned int) _12;
>>
>> This is the reason for checking
>>wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>
> But this then only works when there's an upper bound on the number
> of iterations.  Note you should not use TYPE_MAX_VALUE here but
> you can instead use
>
>  wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> (TYPE_PRECISION (ntype), TYPE_SIGN (ntype;

Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
(ntype)),

TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval.  And then 
call to

wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls
wi::max_value
and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?


TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision.  At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.

Ok, get it, thanks.
I will use "widest_int::from (wi::max_value (..,..),..)".




> I think the -1 above comes from number of latch iterations vs. header
> entries - it's a common source for this kind of issues.  range analysis
> might be able to prove that we can still merge the two adds even with
> the intermediate extension.
Yes, as you mentioned here, it relates to number of latch iterations
For loop looks like : while (l < n) or for (i = 0; i < n; i++)
This kind of loop, the niter is used to be 'n - 1' after transformed
into 'do-while' form.

For this kind of loop, the max value for the number of iteration "n - 1"
would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
This kind of loop is already common, and we could use wi::ltu (max, 
max_value_type)

to check.

For loop looks like:
  do ;
  while (n-- > 0); /* while  (n-- > low); */

The niter_desc->max will wi::eq to max_value_type, and niter would be 
"n",

and then doloop.xx is 'n+1'.


I would see how to merge these two adds safely at this point
when generating doloop iv. (maybe range info, thanks!

>
> Is this pre-loop extra add really offsetting the in-loop 

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 18:02, Richard Biener wrote:

On Mon, 12 Jul 2021, guojiufu wrote:


On 2021-07-12 16:57, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
>
>> On 2021-07-12 14:20, Richard Biener wrote:
>> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >
>> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> > I wonder if there's a way to query the target what modes the doloop
>> >> > pattern can handle (not being too familiar with the doloop code).
>> >>
>> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
>> >> hard to introspect.
>> >>
>> >> > Why do you need to do any checks besides the new type being able to
>> >> > represent all IV values?  The original doloop IV will never wrap
>> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> > but it also still will with a IV with larger type).
>>
>> The issue comes from U*_MAX (original short MAX), as you said: on which
>> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> larger type 'zero - 1' will be a very large number on larger type
>> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> value
>> (e.g. "0xff").
>
> But for the larger type the small type MAX + 1 fits and does not yield
> zero so it should still work exactly as before, no?  Of course you
> have to compute the + 1 in the larger type.
>
You are right, if compute the "+ 1" in the larger type it is ok, as 
below

code:
```
   /* Use type in word size may fast.  */
if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
  {
ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
niter = fold_convert (ntype, niter);
  }

tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
 build_int_cst (ntype, 1));


add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, 
NULL,

true);
```
The issue of this is, this code generates more stmt for doloop.xxx:
  _12 = (unsigned int) xx(D);
  _10 = _12 + 4294967295;
  _24 = (long unsigned int) _10;
  doloop.6_8 = _24 + 1;

if use previous patch, "+ 1" on original type, then the stmts will 
looks like:

  _12 = (unsigned int) xx(D);
  doloop.6_8 = (long unsigned int) _12;

This is the reason for checking
   wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))


But this then only works when there's an upper bound on the number
of iterations.  Note you should not use TYPE_MAX_VALUE here but
you can instead use

 wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
(TYPE_PRECISION (ntype), TYPE_SIGN (ntype;


Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
(ntype)), TYPE_SIGN (ntype))

would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK 
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval.  And then 
call to

wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls 
wi::max_value

and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?



I think the -1 above comes from number of latch iterations vs. header
entries - it's a common source for this kind of issues.  range analysis
might be able to prove that we can still merge the two adds even with
the intermediate extension.

Yes, as you mentioned here, it relates to number of latch iterations
For loop looks like : while (l < n) or for (i = 0; i < n; i++)
This kind of loop, the niter is used to be 'n - 1' after transformed
into 'do-while' form.
I would see how to merge these two adds safely at this point
when generating doloop iv. (maybe range info, thanks!



Is this pre-loop extra add really offsetting the in-loop doloop
improvements?

I'm not catching this question too much, sorry.  I guess your concern
is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
is decreasing niter until 0 (all number >0).
If misunderstand,  thanks for point out.




>> >>
>> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >>
>> >> > I'd have expected sth like
>> >> >
>> >> >ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> > (ntype));
>> >> >
>> >> > thus the decision made using a mode - which is also why I wonder
>> >> > if there's a way to query the target for this.  As you say,
>> >> > it _may_ be fast, so better check (somehow).
>>
>>
>> I was also thinking of using hooks like type_for_size/type_for_mode.
>> /* Use type in word size may fast.  */
>> if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> (ntype
>>   {
>> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 16:57, Richard Biener wrote:

On Mon, 12 Jul 2021, guojiufu wrote:


On 2021-07-12 14:20, Richard Biener wrote:
> On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>
>> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> > I wonder if there's a way to query the target what modes the doloop
>> > pattern can handle (not being too familiar with the doloop code).
>>
>> You can look what modes are allowed for operand 0 of doloop_end,
>> perhaps?  Although that is a define_expand, not a define_insn, so it is
>> hard to introspect.
>>
>> > Why do you need to do any checks besides the new type being able to
>> > represent all IV values?  The original doloop IV will never wrap
>> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> > zero ... I suppose the doloop might still do the correct thing here
>> > but it also still will with a IV with larger type).

The issue comes from U*_MAX (original short MAX), as you said: on 
which
niter + 1 becomes zero.  And because the step for doloop is -1; then, 
on

larger type 'zero - 1' will be a very large number on larger type
(e.g. 0xff...ff); but on the original short type 'zero - 1' is a small 
value

(e.g. "0xff").


But for the larger type the small type MAX + 1 fits and does not yield
zero so it should still work exactly as before, no?  Of course you
have to compute the + 1 in the larger type.

You are right, if compute the "+ 1" in the larger type it is ok, as 
below code:

```
   /* Use type in word size may fast.  */
if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
  {
ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
niter = fold_convert (ntype, niter);
  }

tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
 build_int_cst (ntype, 1));


add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, 
NULL, true);

```
The issue of this is, this code generates more stmt for doloop.xxx:
  _12 = (unsigned int) xx(D);
  _10 = _12 + 4294967295;
  _24 = (long unsigned int) _10;
  doloop.6_8 = _24 + 1;

if use previous patch, "+ 1" on original type, then the stmts will looks 
like:

  _12 = (unsigned int) xx(D);
  doloop.6_8 = (long unsigned int) _12;

This is the reason for checking
   wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))


>>
>> doloop_valid_p guarantees it is simple and doesn't wrap.
>>
>> > I'd have expected sth like
>> >
>> >ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> > (ntype));
>> >
>> > thus the decision made using a mode - which is also why I wonder
>> > if there's a way to query the target for this.  As you say,
>> > it _may_ be fast, so better check (somehow).


I was also thinking of using hooks like type_for_size/type_for_mode.
/* Use type in word size may fast.  */
if (TYPE_PRECISION (ntype) < BITS_PER_WORD
&& Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
(ntype
  {
ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
base = fold_convert (ntype, base);
  }

As you pointed out, this does not query the mode from targets.
As Segher pointed out "doloop_end" checks unsupported mode, while it 
seems

not easy to use it in tree-ssa-loop-ivopts.c.
For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 
requires
Pmode/DImode; while there are other targets that work on other 'mode' 
(e.g.

SI).


In doloop_optimize, there is code:

```
mode = desc->mode;
.
doloop_reg = gen_reg_rtx (mode);
rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
start_label);

word_mode_size = GET_MODE_PRECISION (word_mode);
word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 
1;

if (! doloop_seq
&& mode != word_mode
/* Before trying mode different from the one in that # of
iterations is
   computed, we must be sure that the number of iterations 
fits into

   the new mode.  */
&& (word_mode_size >= GET_MODE_PRECISION (mode)
|| wi::leu_p (iterations_max, word_mode_max)))
  {
if (word_mode_size > GET_MODE_PRECISION (mode))
  count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, 
mode);

else
  count = lowpart_subreg (word_mode, count, mode);
PUT_MODE (doloop_reg, word_mode);
doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
  }
if (! doloop_seq)
  {
if (dump_file)
  fprintf (dump_file,
   "Doloop: Target unwilling to use doloop 
pattern!\n");

return false;
  }
```
The above code first tries the mode of niter_desc by call
targetm.gen_doloop_end
to see if the target can generate doloop insns, if fail, then try to 
use

'word_mode' against gen_doloop_end.


>>
>> Almost all targets just use Pmode, but there is no such guarantee I
>> think, and esp. some targets that do not have machine insns for this
>> 

Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

2021-07-12 Thread guojiufu via Gcc-patches

On 2021-07-12 14:20, Richard Biener wrote:

On Fri, 9 Jul 2021, Segher Boessenkool wrote:


On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> I wonder if there's a way to query the target what modes the doloop
> pattern can handle (not being too familiar with the doloop code).

You can look what modes are allowed for operand 0 of doloop_end,
perhaps?  Although that is a define_expand, not a define_insn, so it 
is

hard to introspect.

> Why do you need to do any checks besides the new type being able to
> represent all IV values?  The original doloop IV will never wrap
> (OTOH if niter is U*_MAX then we compute niter + 1 which will become
> zero ... I suppose the doloop might still do the correct thing here
> but it also still will with a IV with larger type).


The issue comes from U*_MAX (original short MAX), as you said: on which
niter + 1 becomes zero.  And because the step for doloop is -1; then, on
larger type 'zero - 1' will be a very large number on larger type
(e.g. 0xff...ff); but on the original short type 'zero - 1' is a small 
value

(e.g. "0xff").



doloop_valid_p guarantees it is simple and doesn't wrap.

> I'd have expected sth like
>
>ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
> (ntype));
>
> thus the decision made using a mode - which is also why I wonder
> if there's a way to query the target for this.  As you say,
> it _may_ be fast, so better check (somehow).



I was also thinking of using hooks like type_for_size/type_for_mode.
/* Use type in word size may fast.  */
if (TYPE_PRECISION (ntype) < BITS_PER_WORD
&& Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE 
(ntype

  {
ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
base = fold_convert (ntype, base);
  }

As you pointed out, this does not query the mode from targets.
As Segher pointed out "doloop_end" checks unsupported mode, while it 
seems

not easy to use it in tree-ssa-loop-ivopts.c.
For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 
requires
Pmode/DImode; while there are other targets that work on other 'mode' 
(e.g. SI).



In doloop_optimize, there is code:

```
mode = desc->mode;
.
doloop_reg = gen_reg_rtx (mode);
rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, 
start_label);


word_mode_size = GET_MODE_PRECISION (word_mode);
word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
if (! doloop_seq
&& mode != word_mode
/* Before trying mode different from the one in that # of 
iterations is
   computed, we must be sure that the number of iterations fits 
into

   the new mode.  */
&& (word_mode_size >= GET_MODE_PRECISION (mode)
|| wi::leu_p (iterations_max, word_mode_max)))
  {
if (word_mode_size > GET_MODE_PRECISION (mode))
  count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, 
mode);

else
  count = lowpart_subreg (word_mode, count, mode);
PUT_MODE (doloop_reg, word_mode);
doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
  }
if (! doloop_seq)
  {
if (dump_file)
  fprintf (dump_file,
   "Doloop: Target unwilling to use doloop pattern!\n");
return false;
  }
```
The above code first tries the mode of niter_desc by call 
targetm.gen_doloop_end

to see if the target can generate doloop insns, if fail, then try to use
'word_mode' against gen_doloop_end.




Almost all targets just use Pmode, but there is no such guarantee I
think, and esp. some targets that do not have machine insns for this
(but want to generate different code for this anyway) can do pretty 
much

anything.

Maybe using just Pmode here is good enough though?


I think Pmode is a particularly bad choice and I'd prefer word_mode
if we go for any hardcoded mode.  s390x for example seems to handle
both SImode and DImode (but names the helper gen_doloop_si64
for SImode?!).  But indeed it looks like somehow querying doloop_end
is going to be difficult since the expander doesn't have any mode,
so we'd have to actually try emit RTL here.


Instead of using hardcode mode, maybe we could add a hook for targets to 
return

the preferred mode.


Thanks for those valuable comments!

Jiufu Guo





Richard.


Re: Ping: [PATCH 1/2] correct BB frequencies after loop changed

2021-07-04 Thread guojiufu via Gcc-patches

Hi Honza and All,

After more checks, I'm thinking these patches may still be useful.
For patch 1:

https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html

This patch recalculates the loop's BB-count and could correct
some BB-count mismatch for loops which has a single exit.
From the test result, we could say it reduce mismatched BB-counts
slightly.

For patch 2:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555872.html
I updated as below:
It reset the loop's probability when the loop count becomes 
unrealistically
small.  In theory, it seems this would be the right direction to do 
this.


Bootstrap/regtest on powerpc64le with no new regressions. I'm thinking 
if

this is acceptable for trunk?

BR,
Jiufu Guo

Subject: Reset edge probability and BB-count for peeled/unrolled loop

This patch fix handles the case where unrolling in an unreliable count
number can cause a loop to no longer look hot and therefore not get 
aligned.
This patch scale by profile_probability::likely () if unrolled count 
gets
unrealistically small.  And this patch fixes the COUNT/PROB of peeled 
loop.


gcc/ChangeLog:
2021-07-01  Jiufu Guo   
Pat Haugen  

PR rtl-optimization/68212
* cfgloopmanip.c (duplicate_loop_to_header_edge): Reset probablity
of unrolled/peeled loop.

testsuite/ChangeLog:
2021-07-01  Jiufu Guo   
Pat Haugen  
PR rtl-optimization/68212
* gcc.dg/pr68212.c: New test.


---
 gcc/cfgloopmanip.c | 20 ++--
 gcc/testsuite/gcc.dg/pr68212.c | 13 +
 2 files changed, 31 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr68212.c

diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 4a9ab74642c..29d858c878a 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -1258,14 +1258,30 @@ duplicate_loop_to_header_edge (class loop *loop, 
edge e,

  /* If original loop is executed COUNT_IN times, the unrolled
 loop will account SCALE_MAIN_DEN times.  */
  scale_main = count_in.probability_in (scale_main_den);
+
+ /* If we are guessing at the number of iterations and count_in
+becomes unrealistically small, reset probability.  */
+ if (!(count_in.reliable_p () || loop->any_estimate))
+   {
+	  profile_count new_count_in = count_in.apply_probability 
(scale_main);
+	  profile_count preheader_count = loop_preheader_edge 
(loop)->count ();

+ if (new_count_in.apply_scale (1, 10) < preheader_count)
+   scale_main = profile_probability::likely ();
+   }
+
  scale_act = scale_main * prob_pass_main;
}
   else
{
+ profile_count new_loop_count;
  profile_count preheader_count = e->count ();
- for (i = 0; i < ndupl; i++)
-   scale_main = scale_main * scale_step[i];
  scale_act = preheader_count.probability_in (count_in);
+ /* Compute final preheader count after peeling NDUPL copies.  */
+ for (i = 0; i < ndupl; i++)
+	preheader_count = preheader_count.apply_probability 
(scale_step[i]);

+ /* Subtract out exit(s) from peeled copies.  */
+ new_loop_count = count_in - (e->count () - preheader_count);
+ scale_main = new_loop_count.probability_in (count_in);
}
 }

diff --git a/gcc/testsuite/gcc.dg/pr68212.c 
b/gcc/testsuite/gcc.dg/pr68212.c

new file mode 100644
index 000..e0cf71d5202
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr68212.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-vectorize -funroll-loops --param 
max-unroll-times=4 -fdump-rtl-alignments" } */

+
+void foo(long int *a, long int *b, long int n)
+{
+  long int i;
+
+  for (i = 0; i < n; i++)
+a[i] = *b;
+}
+
+/* { dg-final { scan-rtl-dump-times "internal loop alignment added" 1 
"alignments"} } */

+
--
2.17.1



On 2021-06-18 16:24, guojiufu via Gcc-patches wrote:

On 2021-06-15 12:57, guojiufu via Gcc-patches wrote:

On 2021-06-14 17:16, Jan Hubicka wrote:



On 5/6/2021 8:36 PM, guojiufu via Gcc-patches wrote:
> Gentle ping.
>
> Original message:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html
I think you need a more aggressive ping  :-)

OK for the trunk.  Sorry for the long delay.  I kept hoping someone 
else

would step in and look at it.

Sorry, the patch was on my todo list to think through for a while :(
It seems to me that both old and new code needs bit more work.  First
the exit loop frequency is set to

 prob = profile_probability::always ().apply_scale (1, new_est_niter 
+ 1);


which is only correct if the estimated number of iterations is 
accurate.
If we do not have profile feedback and trip count is not known 
precisely
in most cases it won't be.  We estimate loops to iterate about 3 
times
and then niter_for_unrolled_loop will apply the capping to 5 
i

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-02 Thread guojiufu via Gcc-patches

On 2021-07-01 20:35, Richard Biener wrote:

On Thu, 1 Jul 2021, Jiufu Guo wrote:


For code like:
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i > val; ++i)
cnt++;
  return cnt;
}

The number of iterations should be about UINT_MAX - start.


For

unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i >= val; ++i)
cnt++;
  return cnt;
}

and val == 0 the loop never terminates.  I don't see anywhere
in the patch that you disregard GE_EXPR and I remember
the code handles GE as well as GT?  From a quick look this is
also not covered by a testcase you add - not exactly sure
how it would materialize in a miscompilation.


Find a similar issue on the below code with the trunk.
The below code should run infinite, but it exits quickly.

#include 
__attribute__ ((noinline))
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i <= val; i+=16)
cnt++;
  return cnt;
}

int main()
{
  return foo (UINT_MAX-7, 8);
}

Just opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101291

BR,
Jiufu Guo.


There is function adjust_cond_for_loop_until_wrap which
handles similar work for const bases.
Like adjust_cond_for_loop_until_wrap, this patch enhance
function number_of_iterations_cond/number_of_iterations_lt
to analyze number of iterations for this kind of loop.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

gcc/ChangeLog:

PR tree-optimization/101145
* tree-ssa-loop-niter.c
(number_of_iterations_until_wrap): New function.
(number_of_iterations_lt): Invoke above function.
(adjust_cond_for_loop_until_wrap):
Merge to number_of_iterations_until_wrap.
(number_of_iterations_cond): Update invokes for
adjust_cond_for_loop_until_wrap and number_of_iterations_lt.

gcc/testsuite/ChangeLog:

PR tree-optimization/101145
* gcc.dg/vect/pr101145.c: New test.
* gcc.dg/vect/pr101145.inc: New test.
* gcc.dg/vect/pr101145_1.c: New test.
* gcc.dg/vect/pr101145_2.c: New test.
* gcc.dg/vect/pr101145_3.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr101145.c   | 187 
+

 gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
 gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
 gcc/tree-ssa-loop-niter.c  | 150 +++-
 6 files changed, 380 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c 
b/gcc/testsuite/gcc.dg/vect/pr101145.c

new file mode 100644
index 000..74031b031cf
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
@@ -0,0 +1,187 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-O3 -fdump-tree-vect-details" } */
+#include 
+
+unsigned __attribute__ ((noinline))
+foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (UINT_MAX - 64 < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = UINT_MAX - 32;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{  // infininate
+  while (0 <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_5 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  //no loop
+  l = UINT_MAX;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (--l < 64)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = 32;
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+
+int a[3200], b[3200];
+int fail;
+
+int
+main ()
+{
+  unsigned l, n;
+  unsigned res;
+  /* l > n*/
+  n = UINT_MAX - 64;
+  l = n + 

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread guojiufu via Gcc-patches

On 2021-07-02 08:51, Bin.Cheng wrote:

On Thu, Jul 1, 2021 at 10:15 PM guojiufu via Gcc-patches
 wrote:


On 2021-07-01 20:35, Richard Biener wrote:
> On Thu, 1 Jul 2021, Jiufu Guo wrote:
>
>> For code like:
>> unsigned foo(unsigned val, unsigned start)
>> {
>>   unsigned cnt = 0;
>>   for (unsigned i = start; i > val; ++i)
>> cnt++;
>>   return cnt;
>> }
>>
>> The number of iterations should be about UINT_MAX - start.
>
> For
>
> unsigned foo(unsigned val, unsigned start)
> {
>   unsigned cnt = 0;
>   for (unsigned i = start; i >= val; ++i)
> cnt++;
>   return cnt;
> }
>
> and val == 0 the loop never terminates.  I don't see anywhere
> in the patch that you disregard GE_EXPR and I remember
> the code handles GE as well as GT?  From a quick look this is
> also not covered by a testcase you add - not exactly sure
> how it would materialize in a miscompilation.

In number_of_iterations_cond, there is code:
if (code == GE_EXPR || code == GT_EXPR
|| (code == NE_EXPR && integer_zerop (iv0->step)))
   {
 std::swap (iv0, iv1);
 code = swap_tree_comparison (code);
   }
It converts "GT/GE" (i >= val) to "LT/LE" (val <= i),
and LE (val <= i) is converted to LT (val - 1 < i).
So, the code is added to number_of_iterations_lt.

But, this patch leads mis-compilation for unsigned "i >= val" as
above transforms: converting LE (val <= i) to LT (val - 1 < i)
seems not appropriate (e.g where val=0).

I don't know where the exact code is, but IIRC, number_of_iteration
handles boundary conditions when transforming <= into <.  You may
check it out.

Yes, in number_of_iterations_le, there is code to check MAX/MIN
if (integer_nonzerop (iv0->step))
  assumption = fold_build2 (NE_EXPR, boolean_type_node,
iv1->base, TYPE_MAX_VALUE (type));
else
  assumption = fold_build2 (NE_EXPR, boolean_type_node,
iv0->base, TYPE_MIN_VALUE (type));

Checking why this code does not help.



Thanks for pointing out this!!!

I would investigate a way to handle this correctly.
A possible way maybe just to return false for this kind of LE.

IIRC, it checks the boundary conditions, either returns false or
simply introduces more assumptions.

Thanks! Adding more assumptions would help.
The below code also runs into infinite, more assumptions may help this 
code.


__attribute__ ((noinline))
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; val <= i; i+=16)
cnt++;
  return cnt;
}

foo (4, 8);

Thanks again!


BR,
Jiufu Guo


Any suggestions?

>
>> There is function adjust_cond_for_loop_until_wrap which
>> handles similar work for const bases.
>> Like adjust_cond_for_loop_until_wrap, this patch enhance
>> function number_of_iterations_cond/number_of_iterations_lt
>> to analyze number of iterations for this kind of loop.
>>
>> Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
>>
>> gcc/ChangeLog:
>>
>>  PR tree-optimization/101145
>>  * tree-ssa-loop-niter.c
>>  (number_of_iterations_until_wrap): New function.
>>  (number_of_iterations_lt): Invoke above function.
>>  (adjust_cond_for_loop_until_wrap):
>>  Merge to number_of_iterations_until_wrap.
>>  (number_of_iterations_cond): Update invokes for
>>  adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  PR tree-optimization/101145
>>  * gcc.dg/vect/pr101145.c: New test.
>>  * gcc.dg/vect/pr101145.inc: New test.
>>  * gcc.dg/vect/pr101145_1.c: New test.
>>  * gcc.dg/vect/pr101145_2.c: New test.
>>  * gcc.dg/vect/pr101145_3.c: New test.
>> ---
>>  gcc/testsuite/gcc.dg/vect/pr101145.c   | 187
>> +
>>  gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
>>  gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
>>  gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
>>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
>>  gcc/tree-ssa-loop-niter.c  | 150 +++-
>>  6 files changed, 380 insertions(+), 65 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
>> b/gcc/testsuite/gcc.dg/ve

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread guojiufu via Gcc-patches

On 2021-07-01 20:35, Richard Biener wrote:

On Thu, 1 Jul 2021, Jiufu Guo wrote:


For code like:
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i > val; ++i)
cnt++;
  return cnt;
}

The number of iterations should be about UINT_MAX - start.


For

unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i >= val; ++i)
cnt++;
  return cnt;
}

and val == 0 the loop never terminates.  I don't see anywhere
in the patch that you disregard GE_EXPR and I remember
the code handles GE as well as GT?  From a quick look this is
also not covered by a testcase you add - not exactly sure
how it would materialize in a miscompilation.


In number_of_iterations_cond, there is code:
   if (code == GE_EXPR || code == GT_EXPR
|| (code == NE_EXPR && integer_zerop (iv0->step)))
  {
std::swap (iv0, iv1);
code = swap_tree_comparison (code);
  }
It converts "GT/GE" (i >= val) to "LT/LE" (val <= i),
and LE (val <= i) is converted to LT (val - 1 < i).
So, the code is added to number_of_iterations_lt.

But, this patch leads mis-compilation for unsigned "i >= val" as
above transforms: converting LE (val <= i) to LT (val - 1 < i)
seems not appropriate (e.g where val=0).
Thanks for pointing out this!!!

I would investigate a way to handle this correctly.
A possible way maybe just to return false for this kind of LE.

Any suggestions?




There is function adjust_cond_for_loop_until_wrap which
handles similar work for const bases.
Like adjust_cond_for_loop_until_wrap, this patch enhance
function number_of_iterations_cond/number_of_iterations_lt
to analyze number of iterations for this kind of loop.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

gcc/ChangeLog:

PR tree-optimization/101145
* tree-ssa-loop-niter.c
(number_of_iterations_until_wrap): New function.
(number_of_iterations_lt): Invoke above function.
(adjust_cond_for_loop_until_wrap):
Merge to number_of_iterations_until_wrap.
(number_of_iterations_cond): Update invokes for
adjust_cond_for_loop_until_wrap and number_of_iterations_lt.

gcc/testsuite/ChangeLog:

PR tree-optimization/101145
* gcc.dg/vect/pr101145.c: New test.
* gcc.dg/vect/pr101145.inc: New test.
* gcc.dg/vect/pr101145_1.c: New test.
* gcc.dg/vect/pr101145_2.c: New test.
* gcc.dg/vect/pr101145_3.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr101145.c   | 187 
+

 gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
 gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
 gcc/tree-ssa-loop-niter.c  | 150 +++-
 6 files changed, 380 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c 
b/gcc/testsuite/gcc.dg/vect/pr101145.c

new file mode 100644
index 000..74031b031cf
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
@@ -0,0 +1,187 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-O3 -fdump-tree-vect-details" } */
+#include 
+
+unsigned __attribute__ ((noinline))
+foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (UINT_MAX - 64 < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  l = UINT_MAX - 32;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (n <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{  // infininate
+  while (0 <= ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+foo_5 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  //no loop
+  l = UINT_MAX;
+  while (n < ++l)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned 
n)

+{
+  while (--l < n)
+*a++ = *b++ + 1;
+  return l;
+}
+
+unsigned __attribute__ ((noinline))
+bar_1 (int *__restrict__ a, int *__restrict__ b, unsigned l, 
unsigned)

+{
+  while (--l < 64)
+*a++ = *b++ + 1;
+  return l;
+}
+

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread guojiufu via Gcc-patches

On 2021-07-01 15:22, Bin.Cheng wrote:

On Thu, Jul 1, 2021 at 10:06 AM Jiufu Guo via Gcc-patches
 wrote:


For code like:
unsigned foo(unsigned val, unsigned start)
{
  unsigned cnt = 0;
  for (unsigned i = start; i > val; ++i)
cnt++;
  return cnt;
}

The number of iterations should be about UINT_MAX - start.

There is function adjust_cond_for_loop_until_wrap which
handles similar work for const bases.
Like adjust_cond_for_loop_until_wrap, this patch enhance
function number_of_iterations_cond/number_of_iterations_lt
to analyze number of iterations for this kind of loop.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

gcc/ChangeLog:

PR tree-optimization/101145
* tree-ssa-loop-niter.c
(number_of_iterations_until_wrap): New function.
(number_of_iterations_lt): Invoke above function.
(adjust_cond_for_loop_until_wrap):
Merge to number_of_iterations_until_wrap.
(number_of_iterations_cond): Update invokes for
adjust_cond_for_loop_until_wrap and number_of_iterations_lt.

gcc/testsuite/ChangeLog:

PR tree-optimization/101145
* gcc.dg/vect/pr101145.c: New test.
* gcc.dg/vect/pr101145.inc: New test.
* gcc.dg/vect/pr101145_1.c: New test.
* gcc.dg/vect/pr101145_2.c: New test.
* gcc.dg/vect/pr101145_3.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr101145.c   | 187 
+

 gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
 gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
 gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
 gcc/tree-ssa-loop-niter.c  | 150 +++-
 6 files changed, 380 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c




diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index b5add827018..06db6a36ef8 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1473,6 +1473,86 @@ assert_loop_rolls_lt (tree type, affine_iv 
*iv0, affine_iv *iv1,

 }
 }

+/* Determines number of iterations of loop whose ending condition
+   is IV0 < IV1 which likes:  {base, -C} < n,  or n < {base, C}.
+   The number of iterations is stored to NITER.  */
+
+static bool
+number_of_iterations_until_wrap (class loop *, tree type, affine_iv 
*iv0,
+affine_iv *iv1, class tree_niter_desc 
*niter)

+{
+  tree niter_type = unsigned_type_for (type);
+  tree max, min;
+
+  if (POINTER_TYPE_P (type))
+{
+  max = fold_convert (type, TYPE_MAX_VALUE (niter_type));
+  min = fold_convert (type, TYPE_MIN_VALUE (niter_type));
+}
+  else
+{
+  max = TYPE_MAX_VALUE (type);
+  min = TYPE_MIN_VALUE (type);
+}
+
+  tree high = max, low = min, one = build_int_cst (niter_type, 1);
+  tree step;
+
+  /* n < {base, C}. */
+  if (integer_zerop (iv0->step) && TREE_CODE (iv1->step) == 
INTEGER_CST

+  && !tree_int_cst_sign_bit (iv1->step))
+{
+  step = iv1->step;
+  niter->niter = fold_build2 (MINUS_EXPR, niter_type, max, 
iv1->base);
max/iv1->base could be of pointer type, not sure if this is canonical 
though.
Thanks.  Pointer needs careful attention.  I added case pr101145_3.c for 
pointer,
as test, the iteration number is 7: 0xffe4 - 
0x,

where pointer type is pointer to int: "int *".  It works as expected.
I notice in number_of_iterations_lt, there are code likes:
delta = fold_build2 (MINUS_EXPR, niter_type,
 fold_convert (niter_type, iv1->base),
 fold_convert (niter_type, iv0->base));
This would also be ok.




+  if (TREE_CODE (iv1->base) == INTEGER_CST)
+   low = fold_build2 (MINUS_EXPR, type, iv1->base, one);
+  else if (TREE_CODE (iv0->base) == INTEGER_CST)
+   low = iv0->base;
+}
+  /* {base, -C} < n. */
+  else if (TREE_CODE (iv0->step) == INTEGER_CST
+  && tree_int_cst_sign_bit (iv0->step) && integer_zerop 
(iv1->step))

+{
+  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (iv0->step), 
iv0->step);
+  niter->niter = fold_build2 (MINUS_EXPR, niter_type, iv0->base, 
min);

+  if (TREE_CODE (iv0->base) == INTEGER_CST)
+   high = fold_build2 (PLUS_EXPR, type, iv0->base, one);
+  else if (TREE_CODE (iv1->base) == INTEGER_CST)
+   high = iv1->base;
+}
+  else
+return false;
+
+  /* (delta + step - 1) / step */
+  step = fold_convert (niter_type, step);
+  niter->niter = fold_convert (niter_type, niter->niter);
+  niter->niter = fold_build2 (PLUS_EXPR, niter_type, niter->niter, 
step);
+  niter->niter = fold_build2 (FLOOR_DIV_EXPR, niter_type, 
niter->niter, step);

+
+  tree m = fold_build2 

Re: [PATCH V3] Split loop for NE condition.

2021-06-21 Thread guojiufu via Gcc-patches

On 2021-06-21 16:51, Richard Biener wrote:

On Wed, 9 Jun 2021, guojiufu wrote:


On 2021-06-09 17:42, guojiufu via Gcc-patches wrote:
> On 2021-06-08 18:13, Richard Biener wrote:
>> On Fri, 4 Jun 2021, Jiufu Guo wrote:
>>
> cut...
>>> +  gcond *cond = as_a (last);
>>> +  enum tree_code code = gimple_cond_code (cond);
>>> +  if (!(code == NE_EXPR
>>> +  || (code == EQ_EXPR && (e->flags & EDGE_TRUE_VALUE
>>
>> The NE_EXPR check misses a corresponding && (e->flags & EDGE_FALSE_VALUE)
>> check.
>>
> Thanks, check (e->flags & EDGE_FALSE_VALUE) would be safer.
>
>>> +  continue;
>>> +
>>> +  /* Check if bound is invarant.  */
>>> +  tree idx = gimple_cond_lhs (cond);
>>> +  tree bnd = gimple_cond_rhs (cond);
>>> +  if (expr_invariant_in_loop_p (loop, idx))
>>> +  std::swap (idx, bnd);
>>> +  else if (!expr_invariant_in_loop_p (loop, bnd))
>>> +  continue;
>>> +
>>> +  /* Only unsigned type conversion could cause wrap.  */
>>> +  tree type = TREE_TYPE (idx);
>>> +  if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
>>> +|| !TYPE_UNSIGNED (type))
>>> +  continue;
>>> +
>>> +  /* Avoid to split if bound is MAX/MIN val.  */
>>> +  tree bound_type = TREE_TYPE (bnd);
>>> +  if (TREE_CODE (bnd) == INTEGER_CST && INTEGRAL_TYPE_P (bound_type)
>>> +&& (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
>>> +|| tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type
>>> +  continue;
>>
>> Note you do not require 'bnd' to be constant and thus at runtime those
>> cases still need to be handled correctly.
> Yes, bnd is not required to be constant.  The above code is filtering the
> case
> where bnd is const max/min value of the type.  So, the code could be updated
> as:
>   if (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
>   || tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type)))


Yes, and the comment adjusted to "if bound is known to be MAX/MIN val."


>>
>>> +  /* Check if there is possible wrap.  */
>>> +  class tree_niter_desc niter;
>>> +  if (!number_of_iterations_exit (loop, e, , false, false))
> cut...
>>> +
>>> +  /* Change if (i != n) to LOOP1:if (i > n) and LOOP2:if (i < n) */
>>
>> It now occurs to me that we nowhere check the evolution of IDX
>> (split_at_bb_p uses simple_iv for this for example).  The transform
>> assumes that we will actually hit i == n and that i increments, but
>> while you check the control IV from number_of_iterations_exit
>> for NE_EXPR that does not guarantee a positive evolution.
>>
> If I do not correctly reply your question, please point out:
> number_of_iterations_exit is similar with simple_iv to invoke
> simple_iv_with_niters
> which check the evolution, and number_of_iterations_exit check
> number_of_iterations_cond
> which check no_overflow more accurate, this is one reason I use this
> function.
>
> This transform assumes that the last run hits i==n.
> Otherwise, the loop may run infinitely wrap after wrap.
> For safe, if the step is 1 or -1,  this assumption would be true.  I
> would add this check.


OK.


> Thanks so much for pointing out I missed the negative step!
>
>> Your testcases do not include any negative step examples, but I guess
>> the conditions need to be swapped in this case?
>
> I would add cases and code to support step 1/-1.
>
>>
>> I think you also have to consider the order we split, say with
>>
>>   for (i = start; i != end; ++i)
>> {
>>   push (i);
>>   if (a[i] != b[i])
>> break;
>> }
>>
>> push (i) calls need to be in the same order for all cases of
>> start < end, start == end and start > end (and also cover
>> runtime testcases with end == 0 or end == UINT_MAX, likewise
>> for start).
> I add tests for the above cases. If missing sth, please point out, thanks!
>
>>
>>> +  bool inv = expr_invariant_in_loop_p (loop, gimple_cond_lhs (gc));
>>> +  enum tree_code up_code = inv ? LT_EXPR : GT_EXPR;
>>> +  enum tree_code down_code = inv ? GT_EXPR : LT_EXPR;
> cut
>
> Thanks again for the very helpful review!
>
> BR,
> Jiufu Guo.

Here is the updated patch, thanks for your time!

diff --git a/gcc/testsuite/gcc.dg/loop-split1.c
b/gcc/testsuite/gcc.dg/loop-split1.c
new file mode 100644
index 000..dd2d03a7b96

[RFC] New idea to split loop based on no-wrap conditions

2021-06-21 Thread guojiufu via Gcc-patches

On 2021-06-21 14:19, guojiufu via Gcc-patches wrote:

On 2021-06-09 19:18, guojiufu wrote:

On 2021-06-09 17:42, guojiufu via Gcc-patches wrote:

On 2021-06-08 18:13, Richard Biener wrote:

On Fri, 4 Jun 2021, Jiufu Guo wrote:


cut...

cut...




Besides the method in the previous mails, 
I’m thinking of another way to split loops:

foo (int *a, int *b, unsigned k, unsigned n)
{   
 while (++k != n)
   a[k] = b[k] + 1;   
} 

We may split it into:
if (kIn most cases, loop1 would be hit, the overhead of this method is only 
checking “if (k
which would be smaller than the previous method.

And this method would be more easy to extend to nest loops like:
 unsigned int l_n = 0;
 unsigned int l_m = 0;
 unsigned int l_k = 0;
 for (l_n = 0; l_n != n; l_n++)
   for (l_k = 0; l_k != k; l_k++)
 for (l_m = 0; l_m != m; l_m++)
 xxx;

Do you think this method is more valuable to implement? 
Below is a quick patch.  This patch does not support nest loops yet.

diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..c9d161565e4 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -1593,6 +1594,468 @@ split_loop_on_cond (struct loop *loop)
   return do_split;
 }

+/* Filter out type conversions on IDX.
+   Store the shortest type during conversion to SMALL_TYPE.
+   Store the longest type during conversion to LARGE_TYPE.  */
+
+static gimple *
+filter_conversions (class loop *loop, tree idx, tree *small_type = 
NULL,

+   tree *large_type = NULL)
+{
+  gcc_assert (TREE_CODE (idx) == SSA_NAME);
+  gimple *stmt = SSA_NAME_DEF_STMT (idx);
+  while (is_gimple_assign (stmt)
+&& flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+{
+  if (CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
+   {
+ idx = gimple_assign_rhs1 (stmt);
+ if (small_type)
+   {
+ tree type = TREE_TYPE (idx);
+ if (TYPE_PRECISION (*small_type) > TYPE_PRECISION (type)
+ || (TYPE_PRECISION (*small_type) == TYPE_PRECISION (type)
+ && TYPE_UNSIGNED (*small_type) && !TYPE_UNSIGNED (type)))
+   *small_type = type;
+   }
+ if (large_type)
+   {
+ tree type = TREE_TYPE (idx);
+ if (TYPE_PRECISION (*large_type) < TYPE_PRECISION (type)
+ || (TYPE_PRECISION (*large_type) == TYPE_PRECISION (type)
+ && !TYPE_UNSIGNED (*large_type) && TYPE_UNSIGNED (type)))
+   *large_type = type;
+   }
+   }
+  else
+   break;
+
+  if (TREE_CODE (idx) != SSA_NAME)
+   break;
+  stmt = SSA_NAME_DEF_STMT (idx);
+}
+  return stmt;
+}
+
+/* Collection of loop index related elements.  */
+struct idx_elements
+{
+  gcond *gc;
+  gphi *phi;
+  gimple *inc_stmt;
+  tree idx;
+  tree bnd;
+  tree step;
+  tree large_type;
+  tree small_type;
+  bool cmp_on_next;
+};
+
+/*  Analyze and get the idx related elements: bnd,
+phi, increase stmt from exit edge E, etc.
+
+i = phi (b, n)
+...
+n0 = ik + 1
+n1 = (type)n0
+...
+if (i != bnd) or if (n != bnd)
+...
+n = ()nl
+
+   IDX is the i' or n'.  */
+
+bool
+analyze_idx_elements (class loop *loop, edge e, idx_elements )
+{
+  /* Avoid complicated edge.  */
+  if (e->flags & EDGE_FAKE)
+return false;
+  if (e->src != loop->header && e->src != single_pred (loop->latch))
+return false;
+  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, e->src))
+return false;
+
+  /* Check gcond.  */
+  gimple *last = last_stmt (e->src);
+  if (!last || gimple_code (last) != GIMPLE_COND)
+return false;
+
+  /* Get idx and bnd from gcond. */
+  gcond *gc = as_a (last);
+  tree bnd = gimple_cond_rhs (gc);
+  tree idx = gimple_cond_lhs (gc);
+  if (expr_invariant_in_loop_p (loop, idx))
+std::swap (idx, bnd);
+  else if (!expr_invariant_in_loop_p (loop, bnd))
+return false;
+  if (TREE_CODE (idx) != SSA_NAME)
+return false;
+
+  gimple *inc_stmt = NULL;
+  bool cmp_next = false;
+  tree small_type = TREE_TYPE (idx);
+  tree large_type = small_type;
+  gimple *stmt = filter_conversions (loop, idx, _type, 
_type);

+  /* The idx on gcond is not PHI, it would be next. */
+  if (is_gimple_assign (stmt))
+{
+  tree rhs = gimple_assign_rhs1 (stmt);
+  if (TREE_CODE (rhs) != SSA_NAME)
+   return false;
+
+  cmp_next = true;
+  inc_stmt = stmt;
+  stmt = filter_conversions (loop, rhs, _type, _type);
+}
+
+  /* Get phi an

Re: [PATCH V3] Split loop for NE condition.

2021-06-21 Thread guojiufu via Gcc-patches

On 2021-06-09 19:18, guojiufu wrote:

On 2021-06-09 17:42, guojiufu via Gcc-patches wrote:

On 2021-06-08 18:13, Richard Biener wrote:

On Fri, 4 Jun 2021, Jiufu Guo wrote:


cut...

cut...


Here is the updated patch, thanks for your time!


Updates:
. Enhance code to support negative step.
. Check step +-1 to make sure it hits loop condition !=
. Enhance runtime cases to check more boundary cases and run order 
cases.
. Refine for compiling time: check loop num of insns and can_copy_bbs_p 
later




diff --git a/gcc/testsuite/gcc.dg/loop-split1.c
b/gcc/testsuite/gcc.dg/loop-split1.c
new file mode 100644
index 000..dd2d03a7b96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,101 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* No wrap.  */
+unsigned
+foo3_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+void
+bar ();
+void
+foo4 (unsigned n, unsigned i)
+{
+  do
+{
+  if (i == n)
+   return;
+  bar ();
+  ++i;
+}
+  while (1);
+}
+
+unsigned
+find_skip_diff (char *p, char *q, unsigned n, unsigned i)
+{
+  while (p[i] == q[i] && ++i != n)
+p++, q++;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 8 "lsplit" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-split2.c
b/gcc/testsuite/gcc.dg/loop-split2.c
new file mode 100644
index 000..56377e2f2f5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split2.c
@@ -0,0 +1,155 @@
+/* { dg-do run } */
+/* { dg-options "-O3" } */
+
+extern void
+abort (void);
+extern void
+exit (int);
+void
+push (int);
+
+#define NI __attribute__ ((noinline))
+
+void NI
+foo (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+
+unsigned NI
+bar (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (l++ != n)
+{
+  push (l);
+  if (a[l] != b[l])
+   break;
+  push (l + 1);
+}
+  return l;
+}
+
+void NI
+foo_1 (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (--l != n)
+a[l] = b[l] + 1;
+}
+
+unsigned NI
+bar_1 (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (l-- != n)
+{
+  push (l);
+  if (a[l] != b[l])
+   break;
+  push (l + 1);
+}
+
+  return l;
+}
+
+int a[258];
+int b[258];
+int c[1024];
+static int top = 0;
+void
+push (int e)
+{
+  c[top++] = e;
+}
+
+void
+reset ()
+{
+  top = 0;
+  __builtin_memset (c, 0, sizeof (c));
+}
+
+#define check(a, b) (a == b)
+
+int
+check_c (int *c, int a0, int a1, int a2, int a3, int a4, int a5)
+{
+  return check (c[0], a0) && check (c[1], a1) && check (c[2], a2)
+&& check (c[3], a3) && check (c[4], a4) && check (c[5], a5);
+}
+
+int
+main ()
+{
+  __builtin_memcpy (b, a, sizeof (a));
+  reset ();
+  if (bar (a, b, 6, 8) != 9 || !check_c (c, 7, 8, 8, 9, 0, 0))
+abort ();
+
+  reset ();
+  if (bar (a, b, 5, 3) != 4 || !check_c (c, 6, 7, 7, 8, 8, 9)
+  || !check_c (c + 496, 254, 255, 255, 256, 0, 1))
+abort ();
+
+  reset ();
+  if (bar (a, b, 6, 6) != 7 || !check_c (c, 0, 0, 0, 0, 0, 0))
+abort ();
+
+  reset ();
+  if (bar (a, b, 253, 255) != 0 || !check_c (c, 254, 255, 255, 256, 0, 
0))

+abort ();
+
+  reset ();
+  if (bar (a, b, 253, 0) != 1 || !check_c (c, 254, 255, 255, 256, 0, 
1))

+abort ();
+
+  reset ();
+  if (bar_1 (a, b, 6, 8) != 7 || !check_c (c, 5, 6, 4, 5, 3, 4))
+abort ();
+
+  reset ();
+  if (bar_1 (a, b, 5, 3) != 2 || !check_c (c, 4, 5, 3, 4, 0, 0))
+abort ();
+
+  reset ();
+  if (bar_1 (a, b, 6, 6) != 5)
+abort ();
+
+  reset ();
+  if (bar_1 (a, b, 2, 255) != 254 || !check_c (c, 1, 2, 0, 1, 255, 
256))

+abort ();
+
+  reset ();
+  if (bar_1 (a, b, 2, 0) != 255 || !check_c (c, 1, 2, 0, 1, 0, 0))
+abort ();
+
+  b[100] += 1;
+  reset ();
+  if (bar (a, b, 90, 110) != 100)
+abort ();
+
+  reset ();
+  if (bar (a, b, 110, 105) != 100)
+abort ();
+
+  reset ();
+  if (bar_1

Re: Ping: [PATCH 1/2] correct BB frequencies after loop changed

2021-06-18 Thread guojiufu via Gcc-patches

On 2021-06-15 12:57, guojiufu via Gcc-patches wrote:

On 2021-06-14 17:16, Jan Hubicka wrote:



On 5/6/2021 8:36 PM, guojiufu via Gcc-patches wrote:
> Gentle ping.
>
> Original message:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html
I think you need a more aggressive ping  :-)

OK for the trunk.  Sorry for the long delay.  I kept hoping someone 
else

would step in and look at it.

Sorry, the patch was on my todo list to think through for a while :(
It seems to me that both old and new code needs bit more work.  First
the exit loop frequency is set to

 prob = profile_probability::always ().apply_scale (1, new_est_niter + 
1);


which is only correct if the estimated number of iterations is 
accurate.
If we do not have profile feedback and trip count is not known 
precisely

in most cases it won't be.  We estimate loops to iterate about 3 times
and then niter_for_unrolled_loop will apply the capping to 5 
iterations

that is completely arbitrary.

Forcing exit probability to precise may then disable futher loop
optimizations since after the change we will think we know the loop
iterates 5 times and thus it is not worthy for loop opt (which is 
quite
oposite with the fact that we are just unrolling it thinking it is 
hot).


Thanks, understand your concern, both new and old code are assuming the
the number of iterations is accurate.
Maybe we could add code to reset exit probability for the case
where "!count_in.reliable_p ()".



Old code does
 1) scale body down so only one iteration is done
 2) set exit edge probability to be 1/(new_est_iter+1)
precisely
 3) scale up accoring to the 1/new_nonexit_prob
which would be correct if the nonexit probability was updated to
1-exit_probability but that does not seem to happen.

New code does

Yes, this is intended: we know that the enter-count should be
equal to the exit-count of one loop, and then the
"loop-body-count * exit-probability = exit-count".
Also, the entry count of the loop would not be changed before and after
one optimization (or slightly change,e.g. peeling count).

Based on this, we could adjust the loop body count according to
exit-count (or say enter-count) and exit-probability, when the
exit-probability is easy to estimate.


 1) give up when there are multiple exits.
I wonder how common this is - we do outer loop vectorizaiton


Hi Honza, and guys:

I just had a statistic for bootstrap/test and spec2017 build and find
there are ~1700 times of single loops are hit this code; in spec2017 
build,

it hits 226 single-exit loops, and multi-exit loops are not hit.

Had a test with profile-report to see "mismatch count', with these 
patches
we may say the "mismatch count' is mitigated slightly, but not very 
aggressive:

150 mismatch counts are reduced.
But 119 mismatch counts are increased.

Any comments about this patch? Is it acceptable for the trunk? Thanks.


BR,
Jiufu Guo.




The computation in the new code is based on a single exit. This is
also a requirement of old code, and it would be true when run to here.


 2) adjust loop body count according to the exit
 3) updat profile of BB after the exit edge.





Why do you need:
+  if (current_ir_type () != IR_GIMPLE)
+update_br_prob_note (exit->src);

It is tree_transform_and_unroll_loop, so I think we should always have
IR_GIMPLE?


These two lines are added to "recompute_loop_frequencies" which can be 
used

in rtl, like the second patch of this:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555872.html
Oh, maybe these two lines code would be put to 
tree_transform_and_unroll_loop

instead of common code recompute_loop_frequencies.

Thanks a lot for the review in your busy time!

BR.
Jiufu Guo


Honza


jeff


Re: Ping: [PATCH 1/2] correct BB frequencies after loop changed

2021-06-16 Thread guojiufu via Gcc-patches

On 2021-06-15 12:57, guojiufu via Gcc-patches wrote:

On 2021-06-14 17:16, Jan Hubicka wrote:



On 5/6/2021 8:36 PM, guojiufu via Gcc-patches wrote:
> Gentle ping.
>
> Original message:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html
I think you need a more aggressive ping  :-)

OK for the trunk.  Sorry for the long delay.  I kept hoping someone 
else

would step in and look at it.

Sorry, the patch was on my todo list to think through for a while :(
It seems to me that both old and new code needs bit more work.  First
the exit loop frequency is set to

 prob = profile_probability::always ().apply_scale (1, new_est_niter + 
1);


which is only correct if the estimated number of iterations is 
accurate.
If we do not have profile feedback and trip count is not known 
precisely

in most cases it won't be.  We estimate loops to iterate about 3 times
and then niter_for_unrolled_loop will apply the capping to 5 
iterations

that is completely arbitrary.

Forcing exit probability to precise may then disable futher loop
optimizations since after the change we will think we know the loop
iterates 5 times and thus it is not worthy for loop opt (which is 
quite
oposite with the fact that we are just unrolling it thinking it is 
hot).


Thanks, understand your concern, both new and old code are assuming the
the number of iterations is accurate.
Maybe we could add code to reset exit probability for the case
where "!count_in.reliable_p ()".



Old code does
 1) scale body down so only one iteration is done
 2) set exit edge probability to be 1/(new_est_iter+1)
precisely
 3) scale up accoring to the 1/new_nonexit_prob
which would be correct if the nonexit probability was updated to
1-exit_probability but that does not seem to happen.

New code does

Yes, this is intended: we know that the enter-count should be
equal to the exit-count of one loop, and then the
"loop-body-count * exit-probability = exit-count".
Also, the entry count of the loop would not be changed before and after
one optimization (or slightly change,e.g. peeling count).

Based on this, we could adjust the loop body count according to
exit-count (or say enter-count) and exit-probability, when the
exit-probability is easy to estimate.


 1) give up when there are multiple exits.
I wonder how common this is - we do outer loop vectorizaiton


The computation in the new code is based on a single exit. This is
also a requirement of old code, and it would be true when run to here.


To support multiple exits, I'm thinking about the way to calculate the
count/probability for each basic_block and each exit edge.  While it 
seems
the count/prob may not scale up on the same ratio.  This is another 
reason

I give up these cases with multi-exits.

Any suggestions about supporting these cases?


BR,
Jiufu Guo




 2) adjust loop body count according to the exit
 3) updat profile of BB after the exit edge.





Why do you need:
+  if (current_ir_type () != IR_GIMPLE)
+update_br_prob_note (exit->src);

It is tree_transform_and_unroll_loop, so I think we should always have
IR_GIMPLE?


These two lines are added to "recompute_loop_frequencies" which can be 
used

in rtl, like the second patch of this:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555872.html
Oh, maybe these two lines code would be put to 
tree_transform_and_unroll_loop

instead of common code recompute_loop_frequencies.

Thanks a lot for the review in your busy time!

BR.
Jiufu Guo


Honza


jeff


Re: Ping: [PATCH 1/2] correct BB frequencies after loop changed

2021-06-14 Thread guojiufu via Gcc-patches

On 2021-06-14 17:16, Jan Hubicka wrote:



On 5/6/2021 8:36 PM, guojiufu via Gcc-patches wrote:
> Gentle ping.
>
> Original message:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html
I think you need a more aggressive ping  :-)

OK for the trunk.  Sorry for the long delay.  I kept hoping someone 
else

would step in and look at it.

Sorry, the patch was on my todo list to think through for a while :(
It seems to me that both old and new code needs bit more work.  First
the exit loop frequency is set to

 prob = profile_probability::always ().apply_scale (1, new_est_niter + 
1);


which is only correct if the estimated number of iterations is 
accurate.
If we do not have profile feedback and trip count is not known 
precisely

in most cases it won't be.  We estimate loops to iterate about 3 times
and then niter_for_unrolled_loop will apply the capping to 5 iterations
that is completely arbitrary.

Forcing exit probability to precise may then disable futher loop
optimizations since after the change we will think we know the loop
iterates 5 times and thus it is not worthy for loop opt (which is quite
oposite with the fact that we are just unrolling it thinking it is 
hot).


Thanks, understand your concern, both new and old code are assuming the
the number of iterations is accurate.
Maybe we could add code to reset exit probability for the case
where "!count_in.reliable_p ()".



Old code does
 1) scale body down so only one iteration is done
 2) set exit edge probability to be 1/(new_est_iter+1)
precisely
 3) scale up accoring to the 1/new_nonexit_prob
which would be correct if the nonexit probability was updated to
1-exit_probability but that does not seem to happen.

New code does

Yes, this is intended: we know that the enter-count should be
equal to the exit-count of one loop, and then the
"loop-body-count * exit-probability = exit-count".
Also, the entry count of the loop would not be changed before and after
one optimization (or slightly change,e.g. peeling count).

Based on this, we could adjust the loop body count according to
exit-count (or say enter-count) and exit-probability, when the
exit-probability is easy to estimate.


 1) give up when there are multiple exits.
I wonder how common this is - we do outer loop vectorizaiton


The computation in the new code is based on a single exit. This is
also a requirement of old code, and it would be true when run to here.


 2) adjust loop body count according to the exit
 3) updat profile of BB after the exit edge.





Why do you need:
+  if (current_ir_type () != IR_GIMPLE)
+update_br_prob_note (exit->src);

It is tree_transform_and_unroll_loop, so I think we should always have
IR_GIMPLE?


These two lines are added to "recompute_loop_frequencies" which can be 
used

in rtl, like the second patch of this:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555872.html
Oh, maybe these two lines code would be put to 
tree_transform_and_unroll_loop

instead of common code recompute_loop_frequencies.

Thanks a lot for the review in your busy time!

BR.
Jiufu Guo


Honza


jeff


Re: [PATCH V3] Split loop for NE condition.

2021-06-09 Thread guojiufu via Gcc-patches

On 2021-06-09 17:42, guojiufu via Gcc-patches wrote:

On 2021-06-08 18:13, Richard Biener wrote:

On Fri, 4 Jun 2021, Jiufu Guo wrote:


cut...

+  gcond *cond = as_a (last);
+  enum tree_code code = gimple_cond_code (cond);
+  if (!(code == NE_EXPR
+   || (code == EQ_EXPR && (e->flags & EDGE_TRUE_VALUE


The NE_EXPR check misses a corresponding && (e->flags & 
EDGE_FALSE_VALUE)

check.


Thanks, check (e->flags & EDGE_FALSE_VALUE) would be safer.


+   continue;
+
+  /* Check if bound is invarant.  */
+  tree idx = gimple_cond_lhs (cond);
+  tree bnd = gimple_cond_rhs (cond);
+  if (expr_invariant_in_loop_p (loop, idx))
+   std::swap (idx, bnd);
+  else if (!expr_invariant_in_loop_p (loop, bnd))
+   continue;
+
+  /* Only unsigned type conversion could cause wrap.  */
+  tree type = TREE_TYPE (idx);
+  if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
+ || !TYPE_UNSIGNED (type))
+   continue;
+
+  /* Avoid to split if bound is MAX/MIN val.  */
+  tree bound_type = TREE_TYPE (bnd);
+  if (TREE_CODE (bnd) == INTEGER_CST && INTEGRAL_TYPE_P 
(bound_type)

+ && (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
+ || tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type
+   continue;


Note you do not require 'bnd' to be constant and thus at runtime those
cases still need to be handled correctly.
Yes, bnd is not required to be constant.  The above code is filtering 
the case
where bnd is const max/min value of the type.  So, the code could be 
updated as:

  if (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
  || tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type)))




+  /* Check if there is possible wrap.  */
+  class tree_niter_desc niter;
+  if (!number_of_iterations_exit (loop, e, , false, 
false))

cut...

+
+  /* Change if (i != n) to LOOP1:if (i > n) and LOOP2:if (i < n) */


It now occurs to me that we nowhere check the evolution of IDX
(split_at_bb_p uses simple_iv for this for example).  The transform
assumes that we will actually hit i == n and that i increments, but
while you check the control IV from number_of_iterations_exit
for NE_EXPR that does not guarantee a positive evolution.


If I do not correctly reply your question, please point out:
number_of_iterations_exit is similar with simple_iv to invoke
simple_iv_with_niters
which check the evolution, and number_of_iterations_exit check
number_of_iterations_cond
which check no_overflow more accurate, this is one reason I use this 
function.


This transform assumes that the last run hits i==n.
Otherwise, the loop may run infinitely wrap after wrap.
For safe, if the step is 1 or -1,  this assumption would be true.  I
would add this check.

Thanks so much for pointing out I missed the negative step!


Your testcases do not include any negative step examples, but I guess
the conditions need to be swapped in this case?


I would add cases and code to support step 1/-1.



I think you also have to consider the order we split, say with

  for (i = start; i != end; ++i)
{
  push (i);
  if (a[i] != b[i])
break;
}

push (i) calls need to be in the same order for all cases of
start < end, start == end and start > end (and also cover
runtime testcases with end == 0 or end == UINT_MAX, likewise
for start).
I add tests for the above cases. If missing sth, please point out, 
thanks!





+  bool inv = expr_invariant_in_loop_p (loop, gimple_cond_lhs (gc));
+  enum tree_code up_code = inv ? LT_EXPR : GT_EXPR;
+  enum tree_code down_code = inv ? GT_EXPR : LT_EXPR;

cut

Thanks again for the very helpful review!

BR,
Jiufu Guo.


Here is the updated patch, thanks for your time!

diff --git a/gcc/testsuite/gcc.dg/loop-split1.c 
b/gcc/testsuite/gcc.dg/loop-split1.c

new file mode 100644
index 000..dd2d03a7b96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,101 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] !=

Re: [PATCH V3] Split loop for NE condition.

2021-06-09 Thread guojiufu via Gcc-patches

On 2021-06-08 18:13, Richard Biener wrote:

On Fri, 4 Jun 2021, Jiufu Guo wrote:


cut...

+  gcond *cond = as_a (last);
+  enum tree_code code = gimple_cond_code (cond);
+  if (!(code == NE_EXPR
+   || (code == EQ_EXPR && (e->flags & EDGE_TRUE_VALUE


The NE_EXPR check misses a corresponding && (e->flags & 
EDGE_FALSE_VALUE)

check.


Thanks, check (e->flags & EDGE_FALSE_VALUE) would be safer.


+   continue;
+
+  /* Check if bound is invarant.  */
+  tree idx = gimple_cond_lhs (cond);
+  tree bnd = gimple_cond_rhs (cond);
+  if (expr_invariant_in_loop_p (loop, idx))
+   std::swap (idx, bnd);
+  else if (!expr_invariant_in_loop_p (loop, bnd))
+   continue;
+
+  /* Only unsigned type conversion could cause wrap.  */
+  tree type = TREE_TYPE (idx);
+  if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
+ || !TYPE_UNSIGNED (type))
+   continue;
+
+  /* Avoid to split if bound is MAX/MIN val.  */
+  tree bound_type = TREE_TYPE (bnd);
+  if (TREE_CODE (bnd) == INTEGER_CST && INTEGRAL_TYPE_P 
(bound_type)

+ && (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
+ || tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type
+   continue;


Note you do not require 'bnd' to be constant and thus at runtime those
cases still need to be handled correctly.
Yes, bnd is not required to be constant.  The above code is filtering 
the case
where bnd is const max/min value of the type.  So, the code could be 
updated as:

  if (tree_int_cst_equal (bnd, TYPE_MAX_VALUE (bound_type))
  || tree_int_cst_equal (bnd, TYPE_MIN_VALUE (bound_type)))




+  /* Check if there is possible wrap.  */
+  class tree_niter_desc niter;
+  if (!number_of_iterations_exit (loop, e, , false, false))

cut...

+
+  /* Change if (i != n) to LOOP1:if (i > n) and LOOP2:if (i < n) */


It now occurs to me that we nowhere check the evolution of IDX
(split_at_bb_p uses simple_iv for this for example).  The transform
assumes that we will actually hit i == n and that i increments, but
while you check the control IV from number_of_iterations_exit
for NE_EXPR that does not guarantee a positive evolution.


If I do not correctly reply your question, please point out:
number_of_iterations_exit is similar with simple_iv to invoke 
simple_iv_with_niters
which check the evolution, and number_of_iterations_exit check 
number_of_iterations_cond
which check no_overflow more accurate, this is one reason I use this 
function.


This transform assumes that the last run hits i==n.
Otherwise, the loop may run infinitely wrap after wrap.
For safe, if the step is 1 or -1,  this assumption would be true.  I 
would add this check.


Thanks so much for pointing out I missed the negative step!


Your testcases do not include any negative step examples, but I guess
the conditions need to be swapped in this case?


I would add cases and code to support step 1/-1.



I think you also have to consider the order we split, say with

  for (i = start; i != end; ++i)
{
  push (i);
  if (a[i] != b[i])
break;
}

push (i) calls need to be in the same order for all cases of
start < end, start == end and start > end (and also cover
runtime testcases with end == 0 or end == UINT_MAX, likewise
for start).
I add tests for the above cases. If missing sth, please point out, 
thanks!





+  bool inv = expr_invariant_in_loop_p (loop, gimple_cond_lhs (gc));
+  enum tree_code up_code = inv ? LT_EXPR : GT_EXPR;
+  enum tree_code down_code = inv ? GT_EXPR : LT_EXPR;

cut

Thanks again for the very helpful review!

BR,
Jiufu Guo.




Re: Ping^2: [PATCH 1/2] correct BB frequencies after loop changed

2021-06-06 Thread guojiufu via Gcc-patches

Gentle ping ;)

BR.
Jiufu Guo
On 2021-05-20 15:19, guojiufu via Gcc-patches wrote:

Gentle ping^.

On 2021-05-07 10:36, guojiufu via Gcc-patches wrote:

Gentle ping.

Original message:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html


Thanks,
Jiufu Guo.


Re: [PATCH V2] Split loop for NE condition.

2021-06-01 Thread guojiufu via Gcc-patches

On 2021-06-01 11:28, guojiufu via Gcc-patches wrote:

On 2021-05-26 17:50, Richard Biener wrote:

On Mon, 17 May 2021, Jiufu Guo wrote:





Or relax all this, of course.
It would easy to handle the above cases: e->src before latch, or simple 
header.

To relax this, we may need to peel (partial peel) one loop between the
first loop
and the second loop, or jump into the middle of the second loop.  I had 
a quick

try to implement this, but not find a good way.
Thanks for any suggestions!


Previously, I tested the GCC bootstrap to statistic this kind of loop.
'ch/tree-ssa-loop-ch.c" pass already peeled partial loop and transform 
loops into

do-while form.  This would be one reason that this kind of loop is rare.

BR.
Jiufu Guo.






+ gsi_next ();
+ if (!gsi_end_p (gsi) && gsi_stmt (gsi) == gc)
+   return e;
+   }
+}
+
+  return NULL;
+}
+


Below is an updated patch.  Thanks again for your comments!


diff --git a/gcc/testsuite/g++.dg/vect/pr98064.cc
b/gcc/testsuite/g++.dg/vect/pr98064.cc
index 74043ce7725..dcb2985d05a 100644
--- a/gcc/testsuite/g++.dg/vect/pr98064.cc
+++ b/gcc/testsuite/g++.dg/vect/pr98064.cc
@@ -1,5 +1,7 @@
 // { dg-do compile }
-// { dg-additional-options "-O3" }
+// { dg-additional-options "-O3 -Wno-stringop-overflow" }
+/* There is warning message when "short g = var_8; g; g++"
+   is optimized/analyzed as string operation,e.g. memset.  */

 const long long (const long long &__a, long long &__b) {
   if (__b < __a)
diff --git a/gcc/testsuite/gcc.dg/loop-split1.c
b/gcc/testsuite/gcc.dg/loop-split1.c
new file mode 100644
index 000..dd2d03a7b96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,101 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l] + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* No wrap.  */
+unsigned
+foo3_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+void
+bar ();
+void
+foo4 (unsigned n, unsigned i)
+{
+  do
+{
+  if (i == n)
+   return;
+  bar ();
+  ++i;
+}
+  while (1);
+}
+
+unsigned
+find_skip_diff (char *p, char *q, unsigned n, unsigned i)
+{
+  while (p[i] == q[i] && ++i != n)
+p++, q++;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 8 "lsplit" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-split2.c
b/gcc/testsuite/gcc.dg/loop-split2.c
new file mode 100644
index 000..0d3fded3f61
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split2.c
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-options "-O3" } */
+
+extern void abort (void);
+extern void exit (int);
+
+#define NI __attribute__ ((noinline))
+
+void NI
+foo (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (++l != n)
+a[l] = b[l] + 1;
+}
+
+unsigned NI
+bar (int *a, int *b, unsigned char l, unsigned char n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+int a[258];
+int b[258];
+
+int main()
+{
+  __builtin_memcpy (b, a, sizeof (a));
+
+  if (bar (a, b, 3, 8) != 9)
+abort ();
+
+  if (bar (a, b, 8, 3) != 4)
+abort ();
+
+  b[100] += 1;
+  if (bar (a, b, 90, 110) != 100)
+abort ();
+
+  if (bar (a, b, 110, 105) != 100)
+abort ();
+
+  foo (a, b, 99, 99);
+  a[99] = b[99] + 1;
+  for (int i = 0; i < 256; i++)
+if (a[i] != b[i] + 1)
+  abort();
+
+  exit (0);
+}
+
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..0428b0abea6 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -229,11 +230,14 @@ easy_exit_values (class loop *loop)
conditional).  I.e. the second loop can now be entered either
via the original entry or via NEW_E, 

Re: [PATCH V2] Split loop for NE condition.

2021-05-31 Thread guojiufu via Gcc-patches

On 2021-05-26 17:50, Richard Biener wrote:

On Mon, 17 May 2021, Jiufu Guo wrote:


...


  while (++k > n)
a[k] = b[k]  + 1;

then for the second loop, it could be optimized.


Btw, I think even the first loop should be vectorized.  I see we do
not handle it in niter analysis:

Analyzing loop at t.c:3
t.c:3:14: note:  === analyze_loop_nest ===
t.c:3:14: note:   === vect_analyze_loop_form ===
t.c:3:14: note:=== get_loop_niters ===
t.c:3:14: missed:   not vectorized: number of iterations cannot be
computed.

but the number of iterations should be UINT_MAX - k (unless I'm
missing sth), may_be_zero would be sth like k < n.  It would be
nice to not split this into loops that niter analysis cannot handle ...


For this case on the first loop, it is not vectorized by trunk gcc.
While since we know the type of 'k' and 'n' is unsigned, the umber of 
iterations
would be computed.  I'm wondering about enhancing 
'number_of_iterations_cond' may

able to handle this.

...

+
+/* { dg-final { scan-tree-dump-times "Loop split" 9 "lsplit" } } */


Please consider making the testcase execute ones, verifying computation
results.

Thanks!



diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..5c1742b5ff4 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */



please document use_prev

Thanks.



 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -279,7 +281,8 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)


   gphi * newphi = create_phi_node (new_init, rest);
   add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION);
-  add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION);
+  add_phi_arg (newphi, use_prev ? PHI_RESULT (phi_first) : next, 
new_e,

+  UNKNOWN_LOCATION);
   SET_USE (op, new_init);
 }
 }
@@ -1593,6 +1596,184 @@ split_loop_on_cond (struct loop *loop)
   return do_split;
 }

+/* Check if the LOOP exit branch likes "if (idx != bound)",


is like

Thanks.



+   Return the branch edge which exit loop, if overflow/wrap
+   may happen on "idx".  */


I think we only want to handle wrapping (thus not undefined overflow).

Yes, you are right.



+
+static edge
+get_ne_cond_branch (struct loop *loop)
+{
+  int i;
+  edge e;
+
+  auto_vec edges = get_loop_exit_edges (loop);
+  FOR_EACH_VEC_ELT (edges, i, e)
+{
+  /* Check if there is possible wrap/overflow.  */
+  class tree_niter_desc niter;
+  if (!number_of_iterations_exit (loop, e, , false, false))
+   continue;
+  if (niter.control.no_overflow)
+   return NULL;
+  if (niter.cmp != NE_EXPR)
+   continue;
+
+  /* Check loop is simple to split.  */


it seems like the following and below condition mean "simple"
is either all code is before or after the exit in question,
please improve the comment to explain the two cases.


+  if (single_pred_p (loop->latch)
+ && single_pred_edge (loop->latch)->src == e->src
+ && (gsi_end_p (gsi_start_nondebug_bb (loop->latch


split_loop uses empty_block_p (loop->latch).

yes, empty_block_p also filter label/debug insts.



+   return e;
+
+  /* Simple header.  */
+  if (e->src == loop->header)
+   {
+ if (get_virtual_phi (e->src))
+   continue;


So this disqualifies all loops which store to memory?
We do not need to check this condition since here allows only the 
i++/++i header.



+
+ /* Only one phi.  */
+ gphi_iterator psi = gsi_start_phis (e->src);
+ if (gsi_end_p (psi))
+   continue;
+ gsi_next ();
+ if (!gsi_end_p (psi))
+   continue;
+
+ /* ++i or ++i */
+ gimple_stmt_iterator gsi = gsi_start_bb (e->src);


I think you want gsi_start_nondebug_after_labels_bb (e->src)

Thanks.



+
+ gimple *gc = last_stmt (e->src);
+ tree idx = gimple_cond_lhs (gc);


you have to check the last stmt is a GIMPLE_COND, we have
recorded exits that exit on EH for example.


Above checking "number_of_iterations_exit" could make sure
it is a GIMPLE_COND.  I would add gcc_assert to check/hit this.



+ if (expr_invariant_in_loop_p (loop, idx))
+   idx = gimple_cond_rhs (gc);
+
+ gimple *s1 = gsi_stmt (gsi);
+ if (!(is_gimple_assign (s1) && idx
+   && (idx == gimple_assign_lhs (s1)
+   || idx == 

Re: [PATCH] go/100537 - Bootstrap-O3 and bootstrap-debug fail

2021-05-20 Thread guojiufu via Gcc-patches

On 2021-05-18 14:58, Richard Biener wrote:

On Mon, 17 May 2021, Ian Lance Taylor wrote:


On Mon, May 17, 2021 at 1:17 AM Richard Biener via Gcc-patches
 wrote:
>
> On Fri, May 14, 2021 at 11:19 AM guojiufu via Gcc-patches
>  wrote:
> >
> > On 2021-05-14 15:39, guojiufu via Gcc-patches wrote:
> > > On 2021-05-14 15:15, Richard Biener wrote:
> > >> On May 14, 2021 4:52:56 AM GMT+02:00, Jiufu Guo
> > >>  wrote:
> > >>> As discussed in the PR, Richard mentioned the method to
> > >>> figure out which VAR was not set TREE_ADDRESSABLE, and
> > >>> then cause this failure.  It is address_expression which
> > >>> build addr_expr (build_fold_addr_expr_loc), but not set
> > >>> TREE_ADDRESSABLE.
> > >>>
> > >>> I drafted this patch with reference the comments from Richard
> > >>> in this PR, while I'm not quite sure if more thing need to do.
> > >>> So, please have review, thanks!
> > >>>
> > >>> Bootstrap and regtest pass on ppc64le. Is this ok for trunk?
> > >>
> > >> I suggest to use mark_addresssable unless we're sure expr is always an
> > >> entity where TREE_ADDRESSABLE has the desired meaning.
> >
> > Thanks, Richard!
> > You point out the root concern, I'm not sure ;)
> >
> > With looking at code "mark_addresssable" and code around
> > tree-ssa.c:1013,
> > VAR_P, PARM_DECL, and RESULT_DECL are checked before accessing
> > TREE_ADDRESSABLE.
> > So, just wondering if these entities need to be marked as
> > TREE_ADDRESSABLE?
> >
> > diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
> > index 5d9dbb5d068..85d324a92cc 100644
> > --- a/gcc/go/go-gcc.cc
> > +++ b/gcc/go/go-gcc.cc
> > @@ -1680,6 +1680,11 @@ Gcc_backend::address_expression(Bexpression*
> > bexpr, Location location)
> > if (expr == error_mark_node)
> >   return this->error_expression();
> >
> > +  if ((VAR_P(expr)
> > +   || TREE_CODE(expr) == PARM_DECL
> > +   || TREE_CODE(expr) == RESULT_DECL)
> > +TREE_ADDRESSABLE (expr) = 1;
> > +
>
> The root concern is that mark_addressable does
>
>   while (handled_component_p (x))
> x = TREE_OPERAND (x, 0);
>
> and I do not know the constraints on 'expr' as passed to
> Gcc_backend::address_expression.
>
> I think we need input from Ian here.  Most FEs have their own 
*_mark_addressable
> function where they also emit diagnostics (guess this is handled in
> the actual Go frontend).
> Since Gcc_backend does lowering to GENERIC using a middle-end is probably OK.

I doubt I understand all the issues here.

In general the Go frontend only takes the addresses of VAR_DECLs or
PARM_DECLs.  It doesn't bother to set TREE_ADDRESSABLE for global
variables for which TREE_STATIC or DECL_EXTERNAL is true.  For local
variables it sets TREE_ADDRESSABLE based on the is_address_taken
parameter to Gcc_backend::local_variable, and similarly for PARM_DECLs
and Gcc_backend::parameter_variable.

The name in the bug report is for a string initializer, which should
be TREE_STATIC == 1 and TREE_PUBLIC == 0.  Perhaps the fix is simply
to set TREE_ADDRESSABLE in Gcc_backend::immutable_struct and
Gcc_backend::implicit_variable.  I can't see how it would hurt to set
TREE_ADDRESSABLE unnecessarily for a TREE_STATIC variable.

But, again, I doubt I understand all the issues here.


GENERIC requires TREE_ADDRESSABLE to be set on all address-taken
VAR_DECLs, PARM_DECLs and RESULT_DECLs - the gimplifier is the
first to require this for correctness.  Setting TREE_ADDRESSABLE
when the address is not taken is harmless and at most results in
missed optimizations (on most entities we are able to clear the
flag later).

We're currently quite forgiving with this though (still the
gimplifier can generate wrong-code).  The trigger of the current
failure removed one "forgiveness", I do plan to remove a few more.

guojiufu's patch works for me but as said I'm not sure if there's
a better place to set TREE_ADDRESSABLE for entities that have
their address taken - definitely catching the places where
you build an ADDR_EXPR are the most obvious ones.

Richard.


I tested below patch As Ian said, bootstrap pass.


diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..529f657598a 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -2943,6 +2943,7 @@ Gcc_backend::implicit_variable(const std::string& 
name,

   TREE_STATIC(decl) = 1;
   TREE_USED(decl) = 1;
   DECL_ARTIFICIAL(decl) = 1;
+  TREE_ADDRESSABLE(decl) = 1;
   if (is_common)
 {
   DECL_COMMON(decl) = 1;
@@ -3053,6 +3054,7 @@ Gcc_backend::immutable_s

Ping^1: [PATCH 1/2] correct BB frequencies after loop changed

2021-05-20 Thread guojiufu via Gcc-patches

Gentle ping^.

On 2021-05-07 10:36, guojiufu via Gcc-patches wrote:

Gentle ping.

Original message:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html


Thanks,
Jiufu Guo.


Re: [PATCH] go/100537 - Bootstrap-O3 and bootstrap-debug fail

2021-05-18 Thread guojiufu via Gcc-patches

On 2021-05-18 14:58, Richard Biener wrote:

On Mon, 17 May 2021, Ian Lance Taylor wrote:


On Mon, May 17, 2021 at 1:17 AM Richard Biener via Gcc-patches
 wrote:
>
> On Fri, May 14, 2021 at 11:19 AM guojiufu via Gcc-patches
>  wrote:
> >
> > On 2021-05-14 15:39, guojiufu via Gcc-patches wrote:
> > > On 2021-05-14 15:15, Richard Biener wrote:
> > >> On May 14, 2021 4:52:56 AM GMT+02:00, Jiufu Guo
> > >>  wrote:
> > >>> As discussed in the PR, Richard mentioned the method to
> > >>> figure out which VAR was not set TREE_ADDRESSABLE, and
> > >>> then cause this failure.  It is address_expression which
> > >>> build addr_expr (build_fold_addr_expr_loc), but not set
> > >>> TREE_ADDRESSABLE.
> > >>>
> > >>> I drafted this patch with reference the comments from Richard
> > >>> in this PR, while I'm not quite sure if more thing need to do.
> > >>> So, please have review, thanks!
> > >>>
> > >>> Bootstrap and regtest pass on ppc64le. Is this ok for trunk?
> > >>
> > >> I suggest to use mark_addresssable unless we're sure expr is always an
> > >> entity where TREE_ADDRESSABLE has the desired meaning.
> >
> > Thanks, Richard!
> > You point out the root concern, I'm not sure ;)
> >
> > With looking at code "mark_addresssable" and code around
> > tree-ssa.c:1013,
> > VAR_P, PARM_DECL, and RESULT_DECL are checked before accessing
> > TREE_ADDRESSABLE.
> > So, just wondering if these entities need to be marked as
> > TREE_ADDRESSABLE?
> >
> > diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
> > index 5d9dbb5d068..85d324a92cc 100644
> > --- a/gcc/go/go-gcc.cc
> > +++ b/gcc/go/go-gcc.cc
> > @@ -1680,6 +1680,11 @@ Gcc_backend::address_expression(Bexpression*
> > bexpr, Location location)
> > if (expr == error_mark_node)
> >   return this->error_expression();
> >
> > +  if ((VAR_P(expr)
> > +   || TREE_CODE(expr) == PARM_DECL
> > +   || TREE_CODE(expr) == RESULT_DECL)
> > +TREE_ADDRESSABLE (expr) = 1;
> > +
>
> The root concern is that mark_addressable does
>
>   while (handled_component_p (x))
> x = TREE_OPERAND (x, 0);
>
> and I do not know the constraints on 'expr' as passed to
> Gcc_backend::address_expression.
>
> I think we need input from Ian here.  Most FEs have their own 
*_mark_addressable
> function where they also emit diagnostics (guess this is handled in
> the actual Go frontend).
> Since Gcc_backend does lowering to GENERIC using a middle-end is probably OK.

I doubt I understand all the issues here.

In general the Go frontend only takes the addresses of VAR_DECLs or
PARM_DECLs.  It doesn't bother to set TREE_ADDRESSABLE for global
variables for which TREE_STATIC or DECL_EXTERNAL is true.  For local
variables it sets TREE_ADDRESSABLE based on the is_address_taken
parameter to Gcc_backend::local_variable, and similarly for PARM_DECLs
and Gcc_backend::parameter_variable.

The name in the bug report is for a string initializer, which should
be TREE_STATIC == 1 and TREE_PUBLIC == 0.  Perhaps the fix is simply
to set TREE_ADDRESSABLE in Gcc_backend::immutable_struct and
Gcc_backend::implicit_variable.  I can't see how it would hurt to set
TREE_ADDRESSABLE unnecessarily for a TREE_STATIC variable.


One more finding:

Gcc_backend::implicit_variable -> build_decl is called
for "" at
Unary_expression::do_get_backend (expressions.cc:5322).

And, this code (as below) from "expressions.cc:5322"
Unary_expression::do_get_backend (expressions.cc:5322):
  gogo->backend()->implicit_variable(var_name, "", btype, true, true, 
false, 0);

where var_name is go..C479


This code is under **"case OPERATOR_AND:"** of a switch statement.
Unary_expression with OPERATOR_AND is "&" expression, I guess,
it may look as taking address.

And as the log mentioned: "PHI ".
In this phi, (59) would be the Unary_expression with 
OPERATOR_AND

on "go..C479".

So, I guess, we may able to treat "Unary_expression with OPERATOR_AND"
as address_taken operation.  Then it would be necessary to mark 
addressable.


address_expression: "ret = gogo->backend()->address_expression(bexpr, 
loc);"
(expressions.cc:5330) is already called under "Unary_expression with 
OPERATOR_AND".


Does this make sense?  If so, we may set "TREE_ADDRESSABLE" just before 
expressions.cc:5330?


Hope this finding is helpful.

BR.
Jiufu Guo.



But, again, I doubt I understand all the issues here.


GENERIC requires TREE_ADDRESSABLE to be set on all address-taken
VAR_D

Re: [PATCH V2] Split loop for NE condition.

2021-05-18 Thread guojiufu via Gcc-patches

On 2021-05-18 18:32, guojiufu wrote:

On 2021-05-18 17:28, guojiufu via Gcc-patches wrote:

On 2021-05-18 14:36, Bernd Edlinger wrote:

On 5/17/21 4:01 AM, Jiufu Guo via Gcc-patches wrote:
When there is the possibility that overflow/wrap may happen on the 
loop index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
  while (++k != n)
a[k] = b[k]  + 1;
}

For this code, if "k > n", k would wrap.  if "k < n" at begining,
it could be optimized (e.g. vectorization).

We can split the loop into two loops:

  while (++k > n)
a[k] = b[k]  + 1;
  while (l++ < n)
a[k] = b[k]  + 1;

then for the second loop, it could be optimized.

This patch is spltting this kind of small loop to achieve better 
performance.


Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?

Thanks!

Jiufu Guo.

gcc/ChangeLog:

2021-05-15  Jiufu Guo  

* tree-ssa-loop-split.c (connect_loop_phis): Add new param.
(get_ne_cond_branch): New function.
(split_ne_loop): New function.
(split_loop_on_ne_cond): New function.
(tree_ssa_split_loops): Use split_loop_on_ne_cond.

gcc/testsuite/ChangeLog:

2021-05-15  Jiufu Guo  

* gcc.dg/loop-split1.c: New test.
* g++.dg/vect/pr98064.cc: Suppress warning.
---
 gcc/testsuite/g++.dg/vect/pr98064.cc |   4 +-
 gcc/testsuite/gcc.dg/loop-split1.c   | 108 +++
 gcc/tree-ssa-loop-split.c| 188 
++-

 3 files changed, 296 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c

diff --git a/gcc/testsuite/g++.dg/vect/pr98064.cc 
b/gcc/testsuite/g++.dg/vect/pr98064.cc

index 74043ce7725..dcb2985d05a 100644
--- a/gcc/testsuite/g++.dg/vect/pr98064.cc
+++ b/gcc/testsuite/g++.dg/vect/pr98064.cc
@@ -1,5 +1,7 @@
 // { dg-do compile }
-// { dg-additional-options "-O3" }
+// { dg-additional-options "-O3 -Wno-stringop-overflow" }
+/* There is warning message when "short g = var_8; g; g++"
+   is optimized/analyzed as string operation,e.g. memset.  */

 const long long (const long long &__a, long long &__b) {
   if (__b < __a)
diff --git a/gcc/testsuite/gcc.dg/loop-split1.c 
b/gcc/testsuite/gcc.dg/loop-split1.c

new file mode 100644
index 000..30b006b1b14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,108 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* No wrap.  */
+unsigned
+foo3_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+void bar();
+void foo4(unsigned n,  unsigned i)
+{
+  do
+{
+  if (i == n)
+return;
+  bar();
+  ++i;
+}
+  while (1);
+}
+
+unsigned
+foo5 (double *a, unsigned n, unsigned i)
+{
+  while (a[i] > 0 && i != n)
+i++;
+
+  return i;
+}
+
+unsigned
+find_skip_diff (char *p, char *q, unsigned n, unsigned i)
+{
+  while (p[i] == q[i] && ++i != n)
+p++,q++;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 9 "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..5c1742b5ff4 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */

 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge 
new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge 
new_e,

+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -279,7 +281,8 @@ conne

Re: [PATCH V2] Split loop for NE condition.

2021-05-18 Thread guojiufu via Gcc-patches

On 2021-05-18 17:28, guojiufu via Gcc-patches wrote:

On 2021-05-18 14:36, Bernd Edlinger wrote:

On 5/17/21 4:01 AM, Jiufu Guo via Gcc-patches wrote:
When there is the possibility that overflow/wrap may happen on the 
loop index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
  while (++k != n)
a[k] = b[k]  + 1;
}

For this code, if "k > n", k would wrap.  if "k < n" at begining,
it could be optimized (e.g. vectorization).

We can split the loop into two loops:

  while (++k > n)
a[k] = b[k]  + 1;
  while (l++ < n)
a[k] = b[k]  + 1;

then for the second loop, it could be optimized.

This patch is spltting this kind of small loop to achieve better 
performance.


Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?

Thanks!

Jiufu Guo.

gcc/ChangeLog:

2021-05-15  Jiufu Guo  

* tree-ssa-loop-split.c (connect_loop_phis): Add new param.
(get_ne_cond_branch): New function.
(split_ne_loop): New function.
(split_loop_on_ne_cond): New function.
(tree_ssa_split_loops): Use split_loop_on_ne_cond.

gcc/testsuite/ChangeLog:

2021-05-15  Jiufu Guo  

* gcc.dg/loop-split1.c: New test.
* g++.dg/vect/pr98064.cc: Suppress warning.
---
 gcc/testsuite/g++.dg/vect/pr98064.cc |   4 +-
 gcc/testsuite/gcc.dg/loop-split1.c   | 108 +++
 gcc/tree-ssa-loop-split.c| 188 
++-

 3 files changed, 296 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c

diff --git a/gcc/testsuite/g++.dg/vect/pr98064.cc 
b/gcc/testsuite/g++.dg/vect/pr98064.cc

index 74043ce7725..dcb2985d05a 100644
--- a/gcc/testsuite/g++.dg/vect/pr98064.cc
+++ b/gcc/testsuite/g++.dg/vect/pr98064.cc
@@ -1,5 +1,7 @@
 // { dg-do compile }
-// { dg-additional-options "-O3" }
+// { dg-additional-options "-O3 -Wno-stringop-overflow" }
+/* There is warning message when "short g = var_8; g; g++"
+   is optimized/analyzed as string operation,e.g. memset.  */

 const long long (const long long &__a, long long &__b) {
   if (__b < __a)
diff --git a/gcc/testsuite/gcc.dg/loop-split1.c 
b/gcc/testsuite/gcc.dg/loop-split1.c

new file mode 100644
index 000..30b006b1b14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,108 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* No wrap.  */
+unsigned
+foo3_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+void bar();
+void foo4(unsigned n,  unsigned i)
+{
+  do
+{
+  if (i == n)
+return;
+  bar();
+  ++i;
+}
+  while (1);
+}
+
+unsigned
+foo5 (double *a, unsigned n, unsigned i)
+{
+  while (a[i] > 0 && i != n)
+i++;
+
+  return i;
+}
+
+unsigned
+find_skip_diff (char *p, char *q, unsigned n, unsigned i)
+{
+  while (p[i] == q[i] && ++i != n)
+p++,q++;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 9 "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..5c1742b5ff4 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */

 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -279,7 +281,8 @@ connect_loop_phis (class loop *loop1, class loop 
*l

Re: [PATCH V2] Split loop for NE condition.

2021-05-18 Thread guojiufu via Gcc-patches

On 2021-05-18 14:36, Bernd Edlinger wrote:

On 5/17/21 4:01 AM, Jiufu Guo via Gcc-patches wrote:
When there is the possibility that overflow/wrap may happen on the 
loop index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
  while (++k != n)
a[k] = b[k]  + 1;
}

For this code, if "k > n", k would wrap.  if "k < n" at begining,
it could be optimized (e.g. vectorization).

We can split the loop into two loops:

  while (++k > n)
a[k] = b[k]  + 1;
  while (l++ < n)
a[k] = b[k]  + 1;

then for the second loop, it could be optimized.

This patch is spltting this kind of small loop to achieve better 
performance.


Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?

Thanks!

Jiufu Guo.

gcc/ChangeLog:

2021-05-15  Jiufu Guo  

* tree-ssa-loop-split.c (connect_loop_phis): Add new param.
(get_ne_cond_branch): New function.
(split_ne_loop): New function.
(split_loop_on_ne_cond): New function.
(tree_ssa_split_loops): Use split_loop_on_ne_cond.

gcc/testsuite/ChangeLog:

2021-05-15  Jiufu Guo  

* gcc.dg/loop-split1.c: New test.
* g++.dg/vect/pr98064.cc: Suppress warning.
---
 gcc/testsuite/g++.dg/vect/pr98064.cc |   4 +-
 gcc/testsuite/gcc.dg/loop-split1.c   | 108 +++
 gcc/tree-ssa-loop-split.c| 188 
++-

 3 files changed, 296 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c

diff --git a/gcc/testsuite/g++.dg/vect/pr98064.cc 
b/gcc/testsuite/g++.dg/vect/pr98064.cc

index 74043ce7725..dcb2985d05a 100644
--- a/gcc/testsuite/g++.dg/vect/pr98064.cc
+++ b/gcc/testsuite/g++.dg/vect/pr98064.cc
@@ -1,5 +1,7 @@
 // { dg-do compile }
-// { dg-additional-options "-O3" }
+// { dg-additional-options "-O3 -Wno-stringop-overflow" }
+/* There is warning message when "short g = var_8; g; g++"
+   is optimized/analyzed as string operation,e.g. memset.  */

 const long long (const long long &__a, long long &__b) {
   if (__b < __a)
diff --git a/gcc/testsuite/gcc.dg/loop-split1.c 
b/gcc/testsuite/gcc.dg/loop-split1.c

new file mode 100644
index 000..30b006b1b14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,108 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+void
+foo_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+/* No wrap.  */
+void
+foo1_1 (int *a, int *b, unsigned n)
+{
+  unsigned l = 0;
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo2_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+unsigned
+foo3 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* No wrap.  */
+unsigned
+foo3_1 (char *a, char *b, unsigned l, unsigned n)
+{
+  l = 0;
+  while (l++ != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+void bar();
+void foo4(unsigned n,  unsigned i)
+{
+  do
+{
+  if (i == n)
+return;
+  bar();
+  ++i;
+}
+  while (1);
+}
+
+unsigned
+foo5 (double *a, unsigned n, unsigned i)
+{
+  while (a[i] > 0 && i != n)
+i++;
+
+  return i;
+}
+
+unsigned
+find_skip_diff (char *p, char *q, unsigned n, unsigned i)
+{
+  while (p[i] == q[i] && ++i != n)
+p++,q++;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 9 "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..5c1742b5ff4 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */

 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -279,7 +281,8 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)


   gphi * newphi = create_phi_node (new_init, rest);
   add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION);
-  add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION);
+  add_phi_arg 

Re: [PATCH] go/100537 - Bootstrap-O3 and bootstrap-debug fail

2021-05-17 Thread guojiufu via Gcc-patches

On 2021-05-17 16:17, Richard Biener wrote:

On Fri, May 14, 2021 at 11:19 AM guojiufu via Gcc-patches
 wrote:


On 2021-05-14 15:39, guojiufu via Gcc-patches wrote:
> On 2021-05-14 15:15, Richard Biener wrote:
>> On May 14, 2021 4:52:56 AM GMT+02:00, Jiufu Guo
>>  wrote:
>>> As discussed in the PR, Richard mentioned the method to
>>> figure out which VAR was not set TREE_ADDRESSABLE, and
>>> then cause this failure.  It is address_expression which
>>> build addr_expr (build_fold_addr_expr_loc), but not set
>>> TREE_ADDRESSABLE.
>>>
>>> I drafted this patch with reference the comments from Richard
>>> in this PR, while I'm not quite sure if more thing need to do.
>>> So, please have review, thanks!
>>>
>>> Bootstrap and regtest pass on ppc64le. Is this ok for trunk?
>>
>> I suggest to use mark_addresssable unless we're sure expr is always an
>> entity where TREE_ADDRESSABLE has the desired meaning.

Thanks, Richard!
You point out the root concern, I'm not sure ;)

With looking at code "mark_addresssable" and code around
tree-ssa.c:1013,
VAR_P, PARM_DECL, and RESULT_DECL are checked before accessing
TREE_ADDRESSABLE.
So, just wondering if these entities need to be marked as
TREE_ADDRESSABLE?

diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..85d324a92cc 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,11 @@ Gcc_backend::address_expression(Bexpression*
bexpr, Location location)
if (expr == error_mark_node)
  return this->error_expression();

+  if ((VAR_P(expr)
+   || TREE_CODE(expr) == PARM_DECL
+   || TREE_CODE(expr) == RESULT_DECL)
+TREE_ADDRESSABLE (expr) = 1;
+


The root concern is that mark_addressable does

  while (handled_component_p (x))
x = TREE_OPERAND (x, 0);

and I do not know the constraints on 'expr' as passed to
Gcc_backend::address_expression.

I think we need input from Ian here.  Most FEs have their own 
*_mark_addressable

function where they also emit diagnostics (guess this is handled in
the actual Go frontend).
Since Gcc_backend does lowering to GENERIC using a middle-end is 
probably OK.


Yeap.  Hope this patch is ok, then the bootstrap could pass.
Otherwise, we may need more help from Ian and guys ;)

Jiufu Guo.



tree ret = build_fold_addr_expr_loc(location.gcc_location(), 
expr);

return this->make_expression(ret);
  }


Or call mark_addressable, and update mark_addressable to avoid NULL
pointer ICE:
The below patch also pass bootstrap-debug.

diff --git a/gcc/gimple-expr.c b/gcc/gimple-expr.c
index b8c732b632a..f682841391b 100644
--- a/gcc/gimple-expr.c
+++ b/gcc/gimple-expr.c
@@ -915,6 +915,7 @@ mark_addressable (tree x)
if (TREE_CODE (x) == VAR_DECL
&& !DECL_EXTERNAL (x)
&& !TREE_STATIC (x)
+  && cfun != NULL


I'd be OK with this hunk of course.


&& cfun->gimple_df != NULL
&& cfun->gimple_df->decls_to_pointers != NULL)
  {
diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..fe9dfaf8579 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,7 @@ Gcc_backend::address_expression(Bexpression*
bexpr, Location location)
if (expr == error_mark_node)
  return this->error_expression();

+  mark_addressable(expr);
tree ret = build_fold_addr_expr_loc(location.gcc_location(), 
expr);

return this->make_expression(ret);
  }


>
> I notice you mentioned "mark_addresssable" in PR.
> And I had tried yesterday, it cause new ICEs at gimple-expr.c:918
> below line:
>
>   && cfun->gimple_df != NULL
>
>
>
>>
>> Richard.
>>
>>> Jiufu Guo.
>>>
>>> 2021-05-14  Richard Biener  
>>> Jiufu Guo 
>>>
>>> PR go/100537
>>> * go-gcc.cc
>>> (Gcc_backend::address_expression): Set TREE_ADDRESSABLE.
>>>
>>> ---
>>> gcc/go/go-gcc.cc | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
>>> index 5d9dbb5d068..8ed20a3b479 100644
>>> --- a/gcc/go/go-gcc.cc
>>> +++ b/gcc/go/go-gcc.cc
>>> @@ -1680,6 +1680,7 @@ Gcc_backend::address_expression(Bexpression*
>>> bexpr, Location location)
>>>   if (expr == error_mark_node)
>>> return this->error_expression();
>>>
>>> +  TREE_ADDRESSABLE (expr) = 1;
>>>   tree ret = build_fold_addr_expr_loc(location.gcc_location(), expr);
>>>   return this->make_expression(ret);
>>> }


[RFC] split loop for NE condition.

2021-05-14 Thread guojiufu via Gcc-patches

I've refined the patch as below.
This patch is checking "unsigned type" adn iv.no_overflow.
While I'm thinking to use "number_of_iterations_exit (loop, e, , 
false, false, NULL)"

and "niter.control.no_overflow" to check overflow/wrap, which maybe
more accurate, but relative "expensive".

"nowrap_type_p and scev_probably_wraps_p" may be little cheaper,
but "number_of_iterations_exit" would be more accurate.

Is this right?

BR,
Jiufu Guo.


diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..425593ca70f 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */

 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -279,7 +281,8 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)


   gphi * newphi = create_phi_node (new_init, rest);
   add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION);
-  add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION);
+  add_phi_arg (newphi, use_prev ? PHI_RESULT (phi_first) : next, 
new_e,

+  UNKNOWN_LOCATION);
   SET_USE (op, new_init);
 }
 }
@@ -1593,6 +1596,229 @@ split_loop_on_cond (struct loop *loop)
   return do_split;
 }

+/* Check if the LOOP exit branch likes "if (idx != bound)",
+   Return the branch edge which exit loop, if overflow/wrap
+   may happen on "idx".  */
+
+static edge
+get_ne_cond_branch (struct loop *loop)
+{
+  int i;
+  edge e;
+
+  auto_vec edges = get_loop_exit_edges (loop);
+  FOR_EACH_VEC_ELT (edges, i, e)
+{
+  basic_block bb = e->src;
+
+  /* Check gcond.  */
+  gimple *last = last_stmt (bb);
+  if (!last || gimple_code (last) != GIMPLE_COND)
+   continue;
+  gcond *cond = as_a (last);
+  enum tree_code code = gimple_cond_code (cond);
+  if (!(code == NE_EXPR
+   || (code == EQ_EXPR && (e->flags & EDGE_TRUE_VALUE
+   continue;
+
+  /* Check if bound is invarant.  */
+  tree idx = gimple_cond_lhs (cond);
+  tree bnd = gimple_cond_rhs (cond);
+  if (expr_invariant_in_loop_p (loop, idx))
+   std::swap (idx, bnd);
+  else if (!expr_invariant_in_loop_p (loop, bnd))
+   continue;
+
+  /* By default, unsigned type conversion could cause overflow.  */
+  tree type = TREE_TYPE (idx);
+  if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
+ || !TYPE_UNSIGNED (type)
+ || TYPE_PRECISION (type) == TYPE_PRECISION (sizetype))
+   continue;
+
+  /* Avoid to split if bound is MAX/MIN val.  */
+  tree bound_type = TREE_TYPE (bnd);
+  if (TREE_CODE (bnd) == INTEGER_CST && INTEGRAL_TYPE_P 
(bound_type)

+ && (bnd == TYPE_MAX_VALUE (bound_type)
+ || bnd == TYPE_MIN_VALUE (bound_type)))
+   continue;
+
+  /* Extract conversion from idx.  */
+  if (TREE_CODE (idx) == SSA_NAME)
+   {
+ gimple *stmt = SSA_NAME_DEF_STMT (idx);
+ if (is_gimple_assign (stmt)
+ && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt))
+ && flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+   idx = gimple_assign_rhs1 (stmt);
+   }
+
+  /* Check if idx is simple iv with possible overflow/wrap.  */
+  class loop *useloop = loop_containing_stmt (cond);
+  affine_iv iv;
+  if (!simple_iv (loop, useloop, idx, , false))
+   continue;
+  if (iv.no_overflow)
+   return NULL;
+
+  /* If base is know value (esplically 0/1), other optimizations 
may be

+able to analyze "idx != bnd" as "idx < bnd" or "idx > bnd".  */
+  if (TREE_CODE (iv.base) == INTEGER_CST)
+   continue;
+
+  /* Check loop is simple to split.  */
+  gcc_assert (bb != loop->latch);
+
+  if (single_pred_p (loop->latch)
+ && single_pred_edge (loop->latch)->src == bb
+ && (gsi_end_p (gsi_start_nondebug_bb (loop->latch
+   return e;
+
+  /* Cheap header.  */
+  if (bb == loop->header)
+   {
+ if (get_virtual_phi (bb))
+   continue;
+
+ /* Only one phi.  */
+ gphi_iterator psi = gsi_start_phis (bb);
+ if (gsi_end_p (psi))
+   continue;
+ gsi_next ();
+ if (!gsi_end_p (psi))
+   continue;
+
+ /* ++i or ++i */
+ gimple_stmt_iterator gsi = gsi_start_bb (bb);
+ if (gsi_end_p (gsi))
+   continue;
+
+ gimple *s1 = gsi_stmt (gsi);
+ 

Re: [PATCH] go/100537 - Bootstrap-O3 and bootstrap-debug fail

2021-05-14 Thread guojiufu via Gcc-patches

On 2021-05-14 15:39, guojiufu via Gcc-patches wrote:

On 2021-05-14 15:15, Richard Biener wrote:
On May 14, 2021 4:52:56 AM GMT+02:00, Jiufu Guo 
 wrote:

As discussed in the PR, Richard mentioned the method to
figure out which VAR was not set TREE_ADDRESSABLE, and
then cause this failure.  It is address_expression which
build addr_expr (build_fold_addr_expr_loc), but not set
TREE_ADDRESSABLE.

I drafted this patch with reference the comments from Richard
in this PR, while I'm not quite sure if more thing need to do.
So, please have review, thanks!

Bootstrap and regtest pass on ppc64le. Is this ok for trunk?


I suggest to use mark_addresssable unless we're sure expr is always an
entity where TREE_ADDRESSABLE has the desired meaning.


Thanks, Richard!
You point out the root concern, I'm not sure ;)

With looking at code "mark_addresssable" and code around 
tree-ssa.c:1013,
VAR_P, PARM_DECL, and RESULT_DECL are checked before accessing 
TREE_ADDRESSABLE.
So, just wondering if these entities need to be marked as 
TREE_ADDRESSABLE?


diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..85d324a92cc 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,11 @@ Gcc_backend::address_expression(Bexpression* 
bexpr, Location location)

   if (expr == error_mark_node)
 return this->error_expression();

+  if ((VAR_P(expr)
+   || TREE_CODE(expr) == PARM_DECL
+   || TREE_CODE(expr) == RESULT_DECL)
+TREE_ADDRESSABLE (expr) = 1;
+
   tree ret = build_fold_addr_expr_loc(location.gcc_location(), expr);
   return this->make_expression(ret);
 }


Or call mark_addressable, and update mark_addressable to avoid NULL 
pointer ICE:

The below patch also pass bootstrap-debug.

diff --git a/gcc/gimple-expr.c b/gcc/gimple-expr.c
index b8c732b632a..f682841391b 100644
--- a/gcc/gimple-expr.c
+++ b/gcc/gimple-expr.c
@@ -915,6 +915,7 @@ mark_addressable (tree x)
   if (TREE_CODE (x) == VAR_DECL
   && !DECL_EXTERNAL (x)
   && !TREE_STATIC (x)
+  && cfun != NULL
   && cfun->gimple_df != NULL
   && cfun->gimple_df->decls_to_pointers != NULL)
 {
diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..fe9dfaf8579 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,7 @@ Gcc_backend::address_expression(Bexpression* 
bexpr, Location location)

   if (expr == error_mark_node)
 return this->error_expression();

+  mark_addressable(expr);
   tree ret = build_fold_addr_expr_loc(location.gcc_location(), expr);
   return this->make_expression(ret);
 }




I notice you mentioned "mark_addresssable" in PR.
And I had tried yesterday, it cause new ICEs at gimple-expr.c:918
below line:

  && cfun->gimple_df != NULL





Richard.


Jiufu Guo.

2021-05-14  Richard Biener  
Jiufu Guo 

PR go/100537
* go-gcc.cc
(Gcc_backend::address_expression): Set TREE_ADDRESSABLE.

---
gcc/go/go-gcc.cc | 1 +
1 file changed, 1 insertion(+)

diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..8ed20a3b479 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,7 @@ Gcc_backend::address_expression(Bexpression*
bexpr, Location location)
  if (expr == error_mark_node)
return this->error_expression();

+  TREE_ADDRESSABLE (expr) = 1;
  tree ret = build_fold_addr_expr_loc(location.gcc_location(), expr);
  return this->make_expression(ret);
}


Re: [PATCH] go/100537 - Bootstrap-O3 and bootstrap-debug fail

2021-05-14 Thread guojiufu via Gcc-patches

On 2021-05-14 15:15, Richard Biener wrote:
On May 14, 2021 4:52:56 AM GMT+02:00, Jiufu Guo 
 wrote:

As discussed in the PR, Richard mentioned the method to
figure out which VAR was not set TREE_ADDRESSABLE, and
then cause this failure.  It is address_expression which
build addr_expr (build_fold_addr_expr_loc), but not set
TREE_ADDRESSABLE.

I drafted this patch with reference the comments from Richard
in this PR, while I'm not quite sure if more thing need to do.
So, please have review, thanks!

Bootstrap and regtest pass on ppc64le. Is this ok for trunk?


I suggest to use mark_addresssable unless we're sure expr is always an
entity where TREE_ADDRESSABLE has the desired meaning.


I notice you mentioned "mark_addresssable" in PR.
And I had tried yesterday, it cause new ICEs at gimple-expr.c:918
below line:

  && cfun->gimple_df != NULL





Richard.


Jiufu Guo.

2021-05-14  Richard Biener  
Jiufu Guo 

PR go/100537
* go-gcc.cc
(Gcc_backend::address_expression): Set TREE_ADDRESSABLE.

---
gcc/go/go-gcc.cc | 1 +
1 file changed, 1 insertion(+)

diff --git a/gcc/go/go-gcc.cc b/gcc/go/go-gcc.cc
index 5d9dbb5d068..8ed20a3b479 100644
--- a/gcc/go/go-gcc.cc
+++ b/gcc/go/go-gcc.cc
@@ -1680,6 +1680,7 @@ Gcc_backend::address_expression(Bexpression*
bexpr, Location location)
  if (expr == error_mark_node)
return this->error_expression();

+  TREE_ADDRESSABLE (expr) = 1;
  tree ret = build_fold_addr_expr_loc(location.gcc_location(), expr);
  return this->make_expression(ret);
}


Re: [PATCH] split loop for NE condition.

2021-05-07 Thread guojiufu via Gcc-patches

On 2021-05-06 16:27, Richard Biener wrote:

On Thu, 6 May 2021, guojiufu wrote:


On 2021-05-03 20:18, Richard Biener wrote:
> On Thu, 29 Apr 2021, Jiufu Guo wrote:
>
>> When there is the possibility that overflow may happen on the loop index,
>> a few optimizations would not happen. For example code:
>>
>> foo (int *a, int *b, unsigned k, unsigned n)
>> {
>>   while (++k != n)
>> a[k] = b[k]  + 1;
>> }
>>
>> For this code, if "l > n", overflow may happen.  if "l < n" at begining,
>> it could be optimized (e.g. vectorization).
>>
>> We can split the loop into two loops:
>>
>>   while (++k > n)
>> a[k] = b[k]  + 1;
>>   while (l++ < n)
>> a[k] = b[k]  + 1;
>>
>> then for the second loop, it could be optimized.
>>
>> This patch is splitting this kind of small loop to achieve better
>> performance.
>>
>> Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?
>
> Do you have any statistics on how often this splits a loop during
> bootstrap (use --with-build-config=bootstrap-O3)?  Or alternatively
> on SPEC?

In SPEC2017, there are ~240 loops are split.  And I saw some 
performance

improvement on xz.
I would try bootstrap-O3 (encounter ICE).
Without this patch, the ICE is also there when building with 
bootstrap-O3 on ppc64le.




>
> Actual comments on the patch inline.
>
>> Thanks!
>>
>> Jiufu Guo.
>>
>> gcc/ChangeLog:
>>
>> 2021-04-29  Jiufu Guo  
>>
>>  * params.opt (max-insns-ne-cond-split): New.
>>  * tree-ssa-loop-split.c (connect_loop_phis): Add new param.
>>  (get_ne_cond_branch): New function.
>>  (split_ne_loop): New function.
>>  (split_loop_on_ne_cond): New function.
>>  (tree_ssa_split_loops): Use split_loop_on_ne_cond.
>>
>> gcc/testsuite/ChangeLog:
>> 2021-04-29  Jiufu Guo  
>>
>>  * gcc.dg/loop-split1.c: New test.
>>
>> ---
>>  gcc/params.opt |   4 +
>>  gcc/testsuite/gcc.dg/loop-split1.c |  28 
>>  gcc/tree-ssa-loop-split.c  | 219
>> -
>>  3 files changed, 247 insertions(+), 4 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c
>>
>> diff --git a/gcc/params.opt b/gcc/params.opt
>> index 2e4cbdd7a71..900b59b5136 100644
>> --- a/gcc/params.opt
>> +++ b/gcc/params.opt
>> @@ -766,6 +766,10 @@ Min. ratio of insns to prefetches to enable
>> prefetching for a loop with an unkno
>> Common Joined UInteger Var(param_min_loop_cond_split_prob) Init(30)
>> IntegerRange(0, 100) Param Optimization
>> The minimum threshold for probability of semi-invariant condition statement
>> to trigger loop split.
>>
>> +-param=max-insns-ne-cond-split=
>> +Common Joined UInteger Var(param_max_insn_ne_cond_split) Init(64) Param
>> Optimization
>> +The maximum threshold for insnstructions number of a loop with ne
>> condition to split.
>> +
>>  -param=min-nondebug-insn-uid=
>>  Common Joined UInteger Var(param_min_nondebug_insn_uid) Param
>>  The minimum UID to be used for a nondebug insn.
>> diff --git a/gcc/testsuite/gcc.dg/loop-split1.c
>> b/gcc/testsuite/gcc.dg/loop-split1.c
>> new file mode 100644
>> index 000..4c466aa9f54
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/loop-split1.c
>> @@ -0,0 +1,28 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
>> +
>> +void
>> +foo (int *a, int *b, unsigned l, unsigned n)
>> +{
>> +  while (++l != n)
>> +a[l] = b[l]  + 1;
>> +}
>> +
>> +void
>> +foo1 (int *a, int *b, unsigned l, unsigned n)
>> +{
>> +  while (l++ != n)
>> +a[l] = b[l]  + 1;
>> +}
>> +
>> +unsigned
>> +foo2 (char *a, char *b, unsigned l, unsigned n)
>> +{
>> +  while (++l != n)
>> +if (a[l] != b[l])
>> +  break;
>> +
>> +  return l;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Loop split" 3 "lsplit" } } */
>> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
>> index b80b6a75e62..a6d28078e5e 100644
>> --- a/gcc/tree-ssa-loop-split.c
>> +++ b/gcc/tree-ssa-loop-split.c
>> @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "cfghooks.h"
>>  #include "gimple-fold.h"
>>  #include "gimplify-me.h"
>> +#include "tree-ssa-loop-ivopts.h"
>>
>>  /* This file implements two kinds of loop splitting.
>>
>> @@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
>> this.  The loops need to fulfill easy_exit_values().  */
>>
>> static void
>> -connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
>> +connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
>> + bool use_prev = false)
>>  {
>>basic_block rest = loop_preheader_edge (loop2)->src;
>>gcc_assert (new_e->dest == rest);
>> @@ -248,13 +250,14 @@ connect_loop_phis (class loop *loop1, class loop
>> *loop2, edge new_e)
>> !gsi_end_p (psi_first);
>> gsi_next (_first), gsi_next (_second))
>>  {
>> -  tree init, next, new_init;
>> +  tree init, next, new_init, prev;
>>use_operand_p op;
>>gphi *phi_first = psi_first.phi ();
>>gphi *phi_second = 

Ping: [PATCH 1/2] correct BB frequencies after loop changed

2021-05-06 Thread guojiufu via Gcc-patches

Gentle ping.

Original message: 
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555871.html



Thanks,
Jiufu Guo.


Re: [PATCH] split loop for NE condition.

2021-05-06 Thread guojiufu via Gcc-patches

On 2021-05-03 20:18, Richard Biener wrote:

On Thu, 29 Apr 2021, Jiufu Guo wrote:

When there is the possibility that overflow may happen on the loop 
index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
  while (++k != n)
a[k] = b[k]  + 1;
}

For this code, if "l > n", overflow may happen.  if "l < n" at 
begining,

it could be optimized (e.g. vectorization).

We can split the loop into two loops:

  while (++k > n)
a[k] = b[k]  + 1;
  while (l++ < n)
a[k] = b[k]  + 1;

then for the second loop, it could be optimized.

This patch is splitting this kind of small loop to achieve better 
performance.


Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?


Do you have any statistics on how often this splits a loop during
bootstrap (use --with-build-config=bootstrap-O3)?  Or alternatively
on SPEC?


In SPEC2017, there are ~240 loops are split.  And I saw some performance 
improvement on xz.

I would try bootstrap-O3 (encounter ICE).



Actual comments on the patch inline.


Thanks!

Jiufu Guo.

gcc/ChangeLog:

2021-04-29  Jiufu Guo  

* params.opt (max-insns-ne-cond-split): New.
* tree-ssa-loop-split.c (connect_loop_phis): Add new param.
(get_ne_cond_branch): New function.
(split_ne_loop): New function.
(split_loop_on_ne_cond): New function.
(tree_ssa_split_loops): Use split_loop_on_ne_cond.

gcc/testsuite/ChangeLog:
2021-04-29  Jiufu Guo  

* gcc.dg/loop-split1.c: New test.

---
 gcc/params.opt |   4 +
 gcc/testsuite/gcc.dg/loop-split1.c |  28 
 gcc/tree-ssa-loop-split.c  | 219 
-

 3 files changed, 247 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 2e4cbdd7a71..900b59b5136 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -766,6 +766,10 @@ Min. ratio of insns to prefetches to enable 
prefetching for a loop with an unkno
 Common Joined UInteger Var(param_min_loop_cond_split_prob) Init(30) 
IntegerRange(0, 100) Param Optimization
 The minimum threshold for probability of semi-invariant condition 
statement to trigger loop split.


+-param=max-insns-ne-cond-split=
+Common Joined UInteger Var(param_max_insn_ne_cond_split) Init(64) 
Param Optimization
+The maximum threshold for insnstructions number of a loop with ne 
condition to split.

+
 -param=min-nondebug-insn-uid=
 Common Joined UInteger Var(param_min_nondebug_insn_uid) Param
 The minimum UID to be used for a nondebug insn.
diff --git a/gcc/testsuite/gcc.dg/loop-split1.c 
b/gcc/testsuite/gcc.dg/loop-split1.c

new file mode 100644
index 000..4c466aa9f54
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split1.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+void
+foo (int *a, int *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+a[l] = b[l]  + 1;
+}
+
+void
+foo1 (int *a, int *b, unsigned l, unsigned n)
+{
+  while (l++ != n)
+a[l] = b[l]  + 1;
+}
+
+unsigned
+foo2 (char *a, char *b, unsigned l, unsigned n)
+{
+  while (++l != n)
+if (a[l] != b[l])
+  break;
+
+  return l;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop split" 3 "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index b80b6a75e62..a6d28078e5e 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "tree-ssa-loop-ivopts.h"

 /* This file implements two kinds of loop splitting.

@@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
this.  The loops need to fulfill easy_exit_values().  */

 static void
-connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
+connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
+  bool use_prev = false)
 {
   basic_block rest = loop_preheader_edge (loop2)->src;
   gcc_assert (new_e->dest == rest);
@@ -248,13 +250,14 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)

!gsi_end_p (psi_first);
gsi_next (_first), gsi_next (_second))
 {
-  tree init, next, new_init;
+  tree init, next, new_init, prev;
   use_operand_p op;
   gphi *phi_first = psi_first.phi ();
   gphi *phi_second = psi_second.phi ();

   init = PHI_ARG_DEF_FROM_EDGE (phi_first, firste);
   next = PHI_ARG_DEF_FROM_EDGE (phi_first, firstn);
+  prev = PHI_RESULT (phi_first);
   op = PHI_ARG_DEF_PTR_FROM_EDGE (phi_second, seconde);
   gcc_assert (operand_equal_for_phi_arg_p (init, USE_FROM_PTR 
(op)));


@@ -279,7 +282,7 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)


   gphi * newphi = create_phi_node (new_init, rest);
   add_phi_arg (newphi, init, skip_first, 

Re: [PATCH] split loop for NE condition.

2021-05-05 Thread guojiufu via Gcc-patches

On 2021-05-01 05:37, Segher Boessenkool wrote:

Hi!

On Thu, Apr 29, 2021 at 05:50:48PM +0800, Jiufu Guo wrote:
When there is the possibility that overflow may happen on the loop 
index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
  while (++k != n)
a[k] = b[k]  + 1;
}

For this code, if "l > n", overflow may happen.  if "l < n" at 
begining,

it could be optimized (e.g. vectorization).


FWIW, this isn't called "overflow" in C: all overflow is undefined
behaviour.

"A computation involving unsigned operands can never overflow, because 
a
result that cannot be represented by the resulting unsigned integer 
type

is reduced modulo the number that is one greater than the largest value
that can be represented by the resulting type."


Thanks for point out this, yes, it may be better to call it as 'wrap' :)




+-param=max-insns-ne-cond-split=
+Common Joined UInteger Var(param_max_insn_ne_cond_split) Init(64) 
Param Optimization
+The maximum threshold for insnstructions number of a loop with ne 
condition to split.


"number of instructions".

Perhaps you should mark up "ne" as a codeword somehow, but because it
is in a help text it is probably better to just, write out "not equal"
or similar?


Would update it accordingly. Thanks for your suggestion!



@@ -248,13 +250,14 @@ connect_loop_phis (class loop *loop1, class loop 
*loop2, edge new_e)

!gsi_end_p (psi_first);
gsi_next (_first), gsi_next (_second))
 {
-  tree init, next, new_init;
+  tree init, next, new_init, prev;
   use_operand_p op;
   gphi *phi_first = psi_first.phi ();
   gphi *phi_second = psi_second.phi ();

   init = PHI_ARG_DEF_FROM_EDGE (phi_first, firste);
   next = PHI_ARG_DEF_FROM_EDGE (phi_first, firstn);
+  prev = PHI_RESULT (phi_first);
   op = PHI_ARG_DEF_PTR_FROM_EDGE (phi_second, seconde);
   gcc_assert (operand_equal_for_phi_arg_p (init, USE_FROM_PTR 
(op)));




I would just declare it at the first use...  Less mental load for the
reader.  (And a smaller patch ;-) )

Yeap, thanks!




+/* Check if the LOOP exit branch likes "if (idx != bound)".
+   if INV is not NULL and the branch is "if (bound != idx)", set *INV 
to true.


"If INV", sentences start with a capital.


Thanks :)



+  /* Make sure idx and bound.  */
+  tree idx = gimple_cond_lhs (cond);
+  tree bnd = gimple_cond_rhs (cond);
+  if (expr_invariant_in_loop_p (loop, idx))
+   {
+ std::swap (idx, bnd);
+ if (inv)
+   *inv = true;
+   }
+  else if (!expr_invariant_in_loop_p (loop, bnd))
+   continue;


Make sure idx and bound what?  What about them?


+  /* Make sure idx is iv.  */
+  class loop *useloop = loop_containing_stmt (cond);
+  affine_iv iv;
+  if (!simple_iv (loop, useloop, idx, , false))
+   continue;


"Make sure idx is a simple_iv"?

Thanks, the comment should be more clear, the intention is:
make sure "lhs/rhs" pair are "index/bound" pair.




+
+  /* No need to split loop, if base is know value.
+Or check range info.  */


"if base is a known value".  Not sure what you mean with range info?
A possible future improvement?
The intention is "If there is no wrap/overflow happen", no need to split 
loop".
If the base is a known value, the index may not wrap/overflow and may be 
able

optimized by other passes.
Using range-info to check wrap/overflow could be a future improvement.




+  /* There is type conversion on idx(or rhs of idx's def).
+And there is converting shorter to longer type. */
+  tree type = TREE_TYPE (idx);
+  if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
+ || !TYPE_UNSIGNED (type)
+ || TYPE_PRECISION (type) == TYPE_PRECISION (sizetype))
+   continue;


"IDX is an unsigned type that is widened to SIZETYPE" etc.

This is better wording :)



This code assumes SIZETYPE is bigger than any other integer type.  Is
that true?  Even if so, the second comment could be improved.

(Not reviewing further, my Gimple isn't near good enough, sorry.  But
at least to my untrained eye it looks pretty good :-) )


Thanks so much for your very helpful comments!

Jiufu Guo.




Segher


Re: [PATCH] split loop for NE condition.

2021-05-05 Thread guojiufu via Gcc-patches

On 2021-05-01 00:27, Jeff Law wrote:

On 4/29/2021 3:50 AM, Jiufu Guo via Gcc-patches wrote:
When there is the possibility that overflow may happen on the loop 
index,

a few optimizations would not happen. For example code:

foo (int *a, int *b, unsigned k, unsigned n)
{
   while (++k != n)
 a[k] = b[k]  + 1;
}

For this code, if "l > n", overflow may happen.  if "l < n" at 
begining,

it could be optimized (e.g. vectorization).

We can split the loop into two loops:

   while (++k > n)
 a[k] = b[k]  + 1;
   while (l++ < n)
 a[k] = b[k]  + 1;

then for the second loop, it could be optimized.

This patch is spltting this kind of small loop to achieve better 
performance.


Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?

Thanks!

Jiufu Guo.

gcc/ChangeLog:

2021-04-29  Jiufu Guo  

* params.opt (max-insns-ne-cond-split): New.
* tree-ssa-loop-split.c (connect_loop_phis): Add new param.
(get_ne_cond_branch): New function.
(split_ne_loop): New function.
(split_loop_on_ne_cond): New function.
(tree_ssa_split_loops): Use split_loop_on_ne_cond.


I haven't looked at the patch in any detail, but I wonder if the same
concept could be used to fix pr59371, which is a long standing
regression.  Yea, it's reported against MIPS, but the concepts are
fairly generic.


Yes, thanks for point out this!  This patch is handling "!=" which is a
little different from pr59371.  While as you point out, the concept can
be used for pr59371: split loop for possible wrap/overflow on 
index/bound.

We could enhance this patch to handle the case in pr59371!

Thanks!
Jiufu Guo.



Jeff


[PATCH] Clean up loop-closed PHIs at loopdone pass

2020-11-05 Thread guojiufu via Gcc-patches
In PR87473, there are discussions about loop-closed PHIs which
are generated for loop optimization passes.  It would be helpful
to clean them up after loop optimization is done, then this may
simplify some jobs of following passes.
This patch introduces a cheaper way to propagate them out in
pass_tree_loop_done.

This patch passes bootstrap and regtest on ppc64le.  Is this ok for trunk?

gcc/ChangeLog
2020-10-05  Jiufu Guo   

* tree-ssa-loop.h (clean_up_loop_closed_phi): New declaration.
* tree-ssa-loop.c (tree_ssa_loop_done): Call clean_up_loop_closed_phi.
* tree-ssa-propagate.c (propagate_rhs_into_lhs): New function.

gcc/testsuite/ChangeLog
2020-10-05  Jiufu Guo   

* gcc.dg/tree-ssa/loopclosedphi.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/loopclosedphi.c |  21 +++
 gcc/tree-ssa-loop.c   |   1 +
 gcc/tree-ssa-loop.h   |   1 +
 gcc/tree-ssa-propagate.c  | 120 ++
 4 files changed, 143 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/loopclosedphi.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loopclosedphi.c 
b/gcc/testsuite/gcc.dg/tree-ssa/loopclosedphi.c
new file mode 100644
index 000..d71b757fbca
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loopclosedphi.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-tree-ch -w -fdump-tree-loopdone-details" } */
+
+void
+t6 (int qz, int wh)
+{
+  int jl = wh;
+
+  while (1.0 * qz / wh < 1)
+{
+  qz = wh * (wh + 2);
+
+  while (wh < 1)
+jl = 0;
+}
+
+  while (qz < 1)
+qz = jl * wh;
+}
+
+/* { dg-final { scan-tree-dump-times "Replacing" 2 "loopdone"} } */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 5e8365d4e83..7d680b2f5d2 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -530,6 +530,7 @@ tree_ssa_loop_done (void)
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
+  clean_up_loop_closed_phi (cfun);
   return 0;
 }
 
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index 9e35125e6e8..baa940b9d1e 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -67,6 +67,7 @@ public:
 extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
 extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = 
NULL);
 extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
+extern unsigned clean_up_loop_closed_phi (function *);
 
 /* Returns the loop of the statement STMT.  */
 
diff --git a/gcc/tree-ssa-propagate.c b/gcc/tree-ssa-propagate.c
index 87dbf55fab9..813143852b9 100644
--- a/gcc/tree-ssa-propagate.c
+++ b/gcc/tree-ssa-propagate.c
@@ -1549,4 +1549,123 @@ propagate_tree_value_into_stmt (gimple_stmt_iterator 
*gsi, tree val)
   else
 gcc_unreachable ();
 }
+
+/* Propagate RHS into all uses of LHS (when possible).
+
+   RHS and LHS are derived from STMT, which is passed in solely so
+   that we can remove it if propagation is successful.  */
+
+static bool
+propagate_rhs_into_lhs (gphi *stmt, tree lhs, tree rhs)
+{
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  gimple_stmt_iterator gsi;
+  gimple *use_stmt;
+  bool changed = false;
+  bool all = true;
+
+  /* Dump details.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+{
+  fprintf (dump_file, "  Replacing '");
+  print_generic_expr (dump_file, lhs, dump_flags);
+  fprintf (dump_file, "' with '");
+  print_generic_expr (dump_file, rhs, dump_flags);
+  fprintf (dump_file, "'\n");
+}
+
+  /* Walk over every use of LHS and try to replace the use with RHS. */
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+{
+  /* It is not safe to propagate into below stmts.  */
+  if (gimple_debug_bind_p (use_stmt)
+ || (gimple_code (use_stmt) == GIMPLE_ASM
+ && !may_propagate_copy_into_asm (lhs))
+ || (TREE_CODE (rhs) == SSA_NAME
+ && SSA_NAME_DEF_STMT (rhs) == use_stmt))
+   {
+ all = false;
+ continue;
+   }
+
+  /* Dump details.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file, "Original statement:");
+ print_gimple_stmt (dump_file, use_stmt, 0, dump_flags);
+   }
+
+  /* Propagate the RHS into this use of the LHS.  */
+  FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+   propagate_value (use_p, rhs);
+
+  /* Propagation may expose new operands to the renamer.  */
+  update_stmt (use_stmt);
+
+  /* If variable index is replaced with a constant, then
+update the invariant flag for ADDR_EXPRs.  */
+  if (gimple_assign_single_p (use_stmt)
+ && TREE_CODE (gimple_assign_rhs1 (use_stmt)) == ADDR_EXPR)
+   recompute_tree_invariant_for_addr_expr (gimple_assign_rhs1 (use_stmt));
+
+  /* Dump details.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf 

[PATCH] fold x << (n % C) to x << (n & C-1) if C meets power2

2020-10-15 Thread guojiufu via Gcc-patches
Hi,

I just had a check on below patch for PR66552.
https://gcc.gnu.org/pipermail/gcc-patches/2020-February/540930.html
It seems this patch works fine now. This patch fixes PR66552 which
request to optimizes (x shift (n mod C)) to
(x shift (n bit_and (C - 1))) when C is a constant and power of two.

As tests, bootstrap and regtests pass on ppc64le. Is it ok for trunk?

Jiufu Guo

gcc/ChangeLog
2020-10-14  Li Jia He  

PR tree-optimization/66552
* match.pd (x << (n % C) -> x << (n & C-1)): New simplification.

testsuite/ChangeLog
2020-10-14  Li Jia He  

PR tree-optimization/66552
* testsuite/gcc.dg/pr66552.c: New testcase.


---
 gcc/match.pd   | 11 ++-
 gcc/testsuite/gcc.dg/pr66552.c | 14 ++
 2 files changed, 24 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr66552.c

diff --git a/gcc/match.pd b/gcc/match.pd
index c3b88168ac4..9070812fe7b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -607,12 +607,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* Optimize TRUNC_MOD_EXPR by a power of two into a BIT_AND_EXPR,
i.e. "X % C" into "X & (C - 1)", if X and C are positive.
Also optimize A % (C << N)  where C is a power of 2,
-   to A & ((C << N) - 1).  */
+   to A & ((C << N) - 1).  And optimize "A shift (B % C)" where C
+   is a power of 2, shift operation included "<<" and ">>" and assume
+   (B % C) will not be negative as shifts negative values would be UB,
+   to  "A shift (B & (C - 1))".  */
 (match (power_of_two_cand @1)
  INTEGER_CST@1)
 (match (power_of_two_cand @1)
  (lshift INTEGER_CST@1 @2))
 (for mod (trunc_mod floor_mod)
+ (for shift (lshift rshift)
+  (simplify
+   (shift @0 (mod @1 (power_of_two_cand@2 @3)))
+   (if (integer_pow2p (@3) && tree_int_cst_sgn (@3) > 0)
+(shift @0 (bit_and @1 (minus @2 { build_int_cst (TREE_TYPE (@2),
+ 1); }))
  (simplify
   (mod @0 (convert?@3 (power_of_two_cand@1 @2)))
   (if ((TYPE_UNSIGNED (type)
diff --git a/gcc/testsuite/gcc.dg/pr66552.c b/gcc/testsuite/gcc.dg/pr66552.c
new file mode 100644
index 000..7583c9ad25a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr66552.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lower" } */
+
+unsigned a(unsigned x, int n)
+{
+  return x >> (n % 32);
+}
+
+unsigned b(unsigned x, int n)
+{
+  return x << (n % 32);
+}
+
+/* { dg-final { scan-tree-dump-not " % " "lower" } } */
-- 
2.25.1



[PATCH 1/2] correct BB frequencies after loop changed

2020-10-09 Thread guojiufu via Gcc-patches
When investigating the issue from 
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549786.html
I find the BB COUNTs of loop seems are not accurate in some case.
For example:

In below figure:


   COUNT:268435456  pre-header
|
|  ..
|  ||
V  v|
   COUNT:805306369|
   / \  |
   33%/   \ |
 / \|
v   v   |
COUNT:268435456  COUNT:536870911  | 
exit-edge |   latch |
  ._.

Those COUNTs have below equations:
COUNT of exit-edge:268435456 = COUNT of pre-header:268435456
COUNT of exit-edge:268435456 = COUNT of header:805306369 * 33
COUNT of header:805306369 = COUNT of pre-header:268435456 + COUNT of 
latch:536870911


While after pcom:

   COUNT:268435456  pre-header
|
|  ..
|  ||
V  v|
   COUNT:268435456|
   / \  |
   50%/   \ |
 / \|
v   v   |
COUNT:134217728  COUNT:134217728  | 
exit-edge |   latch |
  ._.

COUNT != COUNT + COUNT
COUNT != COUNT

In some cases, the probility of exit-edge is easy to estimate, then
those COUNTs of other BBs in loop can be re-caculated.

Bootstrap and regtest pass on ppc64le. Is this ok for trunk?

Jiufu

gcc/ChangeLog:
2020-10-09  Jiufu Guo   

* cfgloopmanip.h (recompute_loop_frequencies): New function.
* cfgloopmanip.c (recompute_loop_frequencies): New implementation.
* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Call
recompute_loop_frequencies.

---
 gcc/cfgloopmanip.c| 53 +++
 gcc/cfgloopmanip.h|  2 +-
 gcc/tree-ssa-loop-manip.c | 28 +++--
 3 files changed, 57 insertions(+), 26 deletions(-)

diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 73134a20e33..b0ca82a67fd 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimplify-me.h"
 #include "tree-ssa-loop-manip.h"
 #include "dumpfile.h"
+#include "cfgrtl.h"
 
 static void copy_loops_to (class loop **, int,
   class loop *);
@@ -1773,3 +1774,55 @@ loop_version (class loop *loop,
 
   return nloop;
 }
+
+/* Recalculate the COUNTs of BBs in LOOP, if the probability of exit edge
+   is NEW_PROB.  */
+
+bool
+recompute_loop_frequencies (class loop *loop, profile_probability new_prob)
+{
+  edge exit = single_exit (loop);
+  if (!exit)
+return false;
+
+  edge e;
+  edge_iterator ei;
+  edge non_exit;
+  basic_block * bbs;
+  profile_count exit_count = loop_preheader_edge (loop)->count ();
+  profile_probability exit_p = exit_count.probability_in (loop->header->count);
+  profile_count base_count = loop->header->count;
+  profile_count after_num = base_count.apply_probability (exit_p);
+  profile_count after_den = base_count.apply_probability (new_prob);
+
+  /* Update BB counts in loop body.
+ COUNT = COUNT
+ COUNT = COUNT * exit_edge_probility
+ The COUNT = COUNT * old_exit_p / new_prob.  */
+  bbs = get_loop_body (loop);
+  scale_bbs_frequencies_profile_count (bbs, loop->num_nodes, after_num,
+after_den);
+  free (bbs);
+
+  /* Update probability and count of the BB besides exit edge (maybe latch).  
*/
+  FOR_EACH_EDGE (e, ei, exit->src->succs)
+if (e != exit)
+  break;
+  non_exit = e;
+
+  non_exit->probability = new_prob.invert ();
+  non_exit->dest->count = profile_count::zero ();
+  FOR_EACH_EDGE (e, ei, non_exit->dest->preds)
+non_exit->dest->count += e->src->count.apply_probability (e->probability);
+
+  /* Update probability and count of exit destination.  */
+  exit->probability = new_prob;
+  exit->dest->count = profile_count::zero ();
+  FOR_EACH_EDGE (e, ei, exit->dest->preds)
+exit->dest->count += e->src->count.apply_probability (e->probability);
+
+  if (current_ir_type () != IR_GIMPLE)
+update_br_prob_note (exit->src);
+
+  return true;
+}
diff --git a/gcc/cfgloopmanip.h b/gcc/cfgloopmanip.h
index 7331e574e2f..d55bab17f65 100644
--- a/gcc/cfgloopmanip.h
+++ b/gcc/cfgloopmanip.h
@@ -62,5 +62,5 @@ class loop * loop_version (class loop *, void *,
basic_block *,
profile_probability, profile_probability,
 

[PATCH 2/2] reset edge probibility and BB-count for peeled/unrolled loop

2020-10-09 Thread guojiufu via Gcc-patches
Hi,
PR68212 mentioned that the COUNT of unrolled loop was not correct, and
comments of this PR also mentioned that loop become 'cold'.

This patch fixes the wrong COUNT/PROB of unrolled loop.  And the
patch handles the case where unrolling in unreliable count number can
cause a loop to no longer look hot and therefor not get aligned.  This
patch scale by profile_probability::likely () if unrolled count gets
unrealistically small.

Bootstrap/regtest on powerpc64le with no new regressions. Ok for trunk?

Jiufu Guo

gcc/ChangeLog:
2020-10-09  Jiufu Guo   
Pat Haugen  

PR rtl-optimization/68212
* cfgloopmanip.c (duplicate_loop_to_header_edge): Reset probablity
of unrolled/peeled loop.

testsuite/ChangeLog:
2020-10-09  Jiufu Guo   
Pat Haugen  
PR rtl-optimization/68212
* gcc.dg/pr68212.c: New test.


---
 gcc/cfgloopmanip.c | 31 +--
 gcc/testsuite/gcc.dg/pr68212.c | 13 +
 2 files changed, 42 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr68212.c

diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index b0ca82a67fd..d3c95498402 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -1260,14 +1260,30 @@ duplicate_loop_to_header_edge (class loop *loop, edge e,
  /* If original loop is executed COUNT_IN times, the unrolled
 loop will account SCALE_MAIN_DEN times.  */
  scale_main = count_in.probability_in (scale_main_den);
+
+ /* If we are guessing at the number of iterations and count_in
+becomes unrealistically small, reset probability.  */
+ if (!(count_in.reliable_p () || loop->any_estimate))
+   {
+ profile_count new_count_in = count_in.apply_probability 
(scale_main);
+ profile_count preheader_count = loop_preheader_edge (loop)->count 
();
+ if (new_count_in.apply_scale (1, 10) < preheader_count)
+   scale_main = profile_probability::likely ();
+   }
+
  scale_act = scale_main * prob_pass_main;
}
   else
{
+ profile_count new_loop_count;
  profile_count preheader_count = e->count ();
- for (i = 0; i < ndupl; i++)
-   scale_main = scale_main * scale_step[i];
  scale_act = preheader_count.probability_in (count_in);
+ /* Compute final preheader count after peeling NDUPL copies.  */
+ for (i = 0; i < ndupl; i++)
+   preheader_count = preheader_count.apply_probability (scale_step[i]);
+ /* Subtract out exit(s) from peeled copies.  */
+ new_loop_count = count_in - (e->count () - preheader_count);
+ scale_main = new_loop_count.probability_in (count_in);
}
 }
 
@@ -1383,6 +1399,17 @@ duplicate_loop_to_header_edge (class loop *loop, edge e,
  scale_bbs_frequencies (new_bbs, n, scale_act);
  scale_act = scale_act * scale_step[j];
}
+
+  /* Need to update PROB of exit edge and corresponding COUNT.  */
+  if (orig && is_latch && (!bitmap_bit_p (wont_exit, j + 1))
+ && bbs_to_scale)
+   {
+ edge new_exit = new_spec_edges[SE_ORIG];
+ profile_count new_count = new_exit->src->count;
+ profile_count exit_count = loop_preheader_edge (loop)->count ();
+ profile_probability prob = exit_count.probability_in (new_count);
+ recompute_loop_frequencies (loop, prob);
+   }
 }
   free (new_bbs);
   free (orig_loops);
diff --git a/gcc/testsuite/gcc.dg/pr68212.c b/gcc/testsuite/gcc.dg/pr68212.c
new file mode 100644
index 000..e0cf71d5202
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr68212.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-vectorize -funroll-loops --param 
max-unroll-times=4 -fdump-rtl-alignments" } */
+
+void foo(long int *a, long int *b, long int n)
+{
+  long int i;
+
+  for (i = 0; i < n; i++)
+a[i] = *b;
+}
+
+/* { dg-final { scan-rtl-dump-times "internal loop alignment added" 1 
"alignments"} } */
+
-- 
2.25.1



[RFC] update COUNTs of BB in loop.

2020-09-21 Thread guojiufu via Gcc-patches
Hi,

When investigating the issue from 
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549786.html
I find the BB COUNTs of loop seems are not accurate in some case.
For example:

In below figure:


   COUNT:268435456  pre-header
|
|  ..
|  ||
V  v|
   COUNT:805306369|
   / \  |
   33%/   \ |
 / \|
v   v   |
COUNT:268435456  COUNT:536870911  | 
exit-edge |   latch |
  ._.

Those COUNTs have below equations:
COUNT of exit-edge:268435456 = COUNT of pre-header:268435456
COUNT of exit-edge:268435456 = COUNT of header:805306369 * 33
COUNT of header:805306369 = COUNT of pre-header:268435456 + COUNT of 
latch:536870911


While after pcom:

   COUNT:268435456  pre-header
|
|  ..
|  ||
V  v|
   COUNT:268435456|
   / \  |
   50%/   \ |
 / \|
v   v   |
COUNT:134217728  COUNT:134217728  | 
exit-edge |   latch |
  ._.

COUNT != COUNT + COUNT

In some cases, the probility of exit-edge is easy to estimate, then
those COUNTs of other BBs in loop can be re-caculated.

Below is a patch to reset COUNTs as above description.

Thanks for comments!!

Jiufu

---
 gcc/cfgloopmanip.c| 43 +++
 gcc/cfgloopmanip.h|  2 +-
 gcc/tree-ssa-loop-manip.c | 28 +++--
 3 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 73134a20e33..180916b0974 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -1773,3 +1773,46 @@ loop_version (class loop *loop,
 
   return nloop;
 }
+
+/* Recacluate the COUNTs of BBs in LOOP, if the probility of exit edge
+   is NEW_EXIT_P.  */
+
+bool
+recompute_loop_frequencies (class loop *loop, profile_probability new_exit_p)
+{
+  edge exit = single_exit (loop);
+  if (!exit)
+return false;
+
+  basic_block * bbs;
+  profile_count exit_count = loop_preheader_edge (loop)->count ();
+  profile_probability exit_p = exit_count.probability_in (loop->header->count);
+  profile_count base_count = loop->header->count;
+  profile_count after_num = base_count.apply_probability (exit_p);
+  profile_count after_den = base_count.apply_probability (new_exit_p);
+
+  /* Update BB counts in loop body.
+ COUNT = COUNT
+ COUNT = COUNT * exit_edge_probility
+ The COUNT=COUNT * old_exit_p / new_exit_p.  */
+  bbs = get_loop_body (loop);
+  scale_bbs_frequencies_profile_count (bbs, loop->num_nodes, after_num,
+after_den);
+  free (bbs);
+
+  /* Update probability and count of latch.  */
+  edge new_nonexit = single_pred_edge (loop->latch);
+  new_nonexit->probability = new_exit_p.invert ();
+  loop->latch->count
+= loop->header->count.apply_probability (new_nonexit->probability);
+
+  /* Update probability and count of exit destination.  */
+  edge e;
+  edge_iterator ei;
+  exit->probability = new_exit_p;
+  exit->dest->count = profile_count::zero ();
+  FOR_EACH_EDGE (e, ei, exit->dest->preds)
+exit->dest->count += e->src->count.apply_probability (e->probability);
+
+  return true;
+}
diff --git a/gcc/cfgloopmanip.h b/gcc/cfgloopmanip.h
index 7331e574e2f..d55bab17f65 100644
--- a/gcc/cfgloopmanip.h
+++ b/gcc/cfgloopmanip.h
@@ -62,5 +62,5 @@ class loop * loop_version (class loop *, void *,
basic_block *,
profile_probability, profile_probability,
profile_probability, profile_probability, bool);
-
+extern bool recompute_loop_frequencies (class loop *, profile_probability);
 #endif /* GCC_CFGLOOPMANIP_H */
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index a2717a411a3..4060d601cf8 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -1251,7 +1251,6 @@ tree_transform_and_unroll_loop (class loop *loop, 
unsigned factor,
   bool ok;
   unsigned i;
   profile_probability prob, prob_entry, scale_unrolled;
-  profile_count freq_e, freq_h;
   gcov_type new_est_niter = niter_for_unrolled_loop (loop, factor);
   unsigned irr = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
   auto_vec to_remove;
@@ -1393,33 +1392,12 @@ tree_transform_and_unroll_loop (class loop *loop, 

[PATCH] Check calls before loop unrolling

2020-08-19 Thread guojiufu via Gcc-patches
Hi,

When unroll loops, if there are calls inside the loop, those calls
may raise negative impacts for unrolling.  This patch adds a param
param_max_unrolled_calls, and checks if the number of calls inside
the loop bigger than this param, loop is prevent from unrolling.

This patch is checking the _average_ number of calls which is the
summary of call numbers multiply the possibility of the call maybe
executed.  The _average_ number could be a fraction, to keep the
precision, the param is the threshold number multiply 1.

Bootstrap and regtest pass on powerpc64le.  Is this ok for trunk?

gcc/ChangeLog
2020-08-19  Jiufu Guo   

* params.opt (param_max_unrolled_average_calls_x1): New param.
* cfgloop.h (average_num_loop_calls): New declare.
* cfgloopanal.c (average_num_loop_calls): New function.
* loop-unroll.c (decide_unroll_constant_iteration,
decide_unroll_runtime_iterations,
decide_unroll_stupid): Check average_num_loop_calls and
param_max_unrolled_average_calls_x1.
---
 gcc/cfgloop.h |  2 ++
 gcc/cfgloopanal.c | 25 +
 gcc/loop-unroll.c | 10 ++
 gcc/params.opt|  4 
 4 files changed, 41 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 18b404e292f..dab933da150 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "sreal.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -387,6 +388,7 @@ extern vec get_loop_exit_edges (const class loop *, 
basic_block * = NULL);
 extern edge single_exit (const class loop *);
 extern edge single_likely_exit (class loop *loop, vec);
 extern unsigned num_loop_branches (const class loop *);
+extern sreal average_num_loop_calls (const class loop *);
 
 extern edge loop_preheader_edge (const class loop *);
 extern edge loop_latch_edge (const class loop *);
diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c
index 0b33e8272a7..a314db4e0c0 100644
--- a/gcc/cfgloopanal.c
+++ b/gcc/cfgloopanal.c
@@ -233,6 +233,31 @@ average_num_loop_insns (const class loop *loop)
   return ret;
 }
 
+/* Count the number of call insns in LOOP.  */
+sreal
+average_num_loop_calls (const class loop *loop)
+{
+  basic_block *bbs;
+  rtx_insn *insn;
+  unsigned int i, bncalls;
+  sreal ncalls = 0;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+{
+  bncalls = 0;
+  FOR_BB_INSNS (bbs[i], insn)
+   if (CALL_P (insn))
+ bncalls++;
+
+  ncalls += (sreal) bncalls
+   * bbs[i]->count.to_sreal_scale (loop->header->count);
+}
+  free (bbs);
+
+  return ncalls;
+}
+
 /* Returns expected number of iterations of LOOP, according to
measured or guessed profile.
 
diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
index 693c7768868..56b8fb37d2a 100644
--- a/gcc/loop-unroll.c
+++ b/gcc/loop-unroll.c
@@ -370,6 +370,10 @@ decide_unroll_constant_iterations (class loop *loop, int 
flags)
 nunroll = nunroll_by_av;
   if (nunroll > (unsigned) param_max_unroll_times)
 nunroll = param_max_unroll_times;
+  if (!loop->unroll
+  && (average_num_loop_calls (loop) * (sreal) 1).to_int ()
+  > (unsigned) param_max_unrolled_average_calls_x1)
+nunroll = 0;
 
   if (targetm.loop_unroll_adjust)
 nunroll = targetm.loop_unroll_adjust (nunroll, loop);
@@ -689,6 +693,9 @@ decide_unroll_runtime_iterations (class loop *loop, int 
flags)
 nunroll = nunroll_by_av;
   if (nunroll > (unsigned) param_max_unroll_times)
 nunroll = param_max_unroll_times;
+  if ((average_num_loop_calls (loop) * (sreal) 1).to_int ()
+  > (unsigned) param_max_unrolled_average_calls_x1)
+nunroll = 0;
 
   if (targetm.loop_unroll_adjust)
 nunroll = targetm.loop_unroll_adjust (nunroll, loop);
@@ -1173,6 +1180,9 @@ decide_unroll_stupid (class loop *loop, int flags)
 nunroll = nunroll_by_av;
   if (nunroll > (unsigned) param_max_unroll_times)
 nunroll = param_max_unroll_times;
+  if ((average_num_loop_calls (loop) * (sreal) 1).to_int ()
+  > (unsigned) param_max_unrolled_average_calls_x1)
+nunroll = 0;
 
   if (targetm.loop_unroll_adjust)
 nunroll = targetm.loop_unroll_adjust (nunroll, loop);
diff --git a/gcc/params.opt b/gcc/params.opt
index f39e5d1a012..80605861223 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -634,6 +634,10 @@ The maximum number of unrollings of a single loop.
 Common Joined UInteger Var(param_max_unrolled_insns) Init(200) Param 
Optimization
 The maximum number of instructions to consider to unroll in a loop.
 
+-param=max-unrolled-average-calls-x1=
+Common Joined UInteger Var(param_max_unrolled_average_calls_x1) Init(0) 
Param Optimization
+The maximum number of calls to consider to unroll in a loop on average and 
multiply 1.
+
 -param=max-unswitch-insns=
 Common Joined UInteger 

[PATCH rs6000]: Refine RTL unroll hook for small loops

2020-07-13 Thread guojiufu via Gcc-patches
Hi,

For very small loops (< 6 insns), it would be fine to unroll 4
times to run fast with less latency and better cache usage.  Like
below loops:
 while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;

With this patch enhances, we could see some performance improvement
for some workloads(e.g. SPEC2017).

Bootstrap and regtest pass on powerpc64le. Ok for trunk?

BR,
Jiufu Guo

2020-07-13  Jiufu Guo   

* config/rs6000/rs6000.c (rs6000_loop_unroll_adjust): Refine hook.

---
 gcc/config/rs6000/rs6000.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 58f5d780603..06844fdba57 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5135,16 +5135,15 @@ rs6000_destroy_cost_data (void *data)
 static unsigned
 rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
 {
-   if (unroll_only_small_loops)
+  if (unroll_only_small_loops)
 {
-  /* TODO: This is hardcoded to 10 right now.  It can be refined, for
-example we may want to unroll very small loops more times (4 perhaps).
-We also should use a PARAM for this.  */
+  /* TODO: Using hardcodes here, for tunable, PARAM(s) maybe refined.  */
+  if (loop->ninsns <= 6)
+   return MIN (4, nunroll);
   if (loop->ninsns <= 10)
return MIN (2, nunroll);
-  else
-   return 0;
+
+  return 0;
 }
 
   return nunroll;
-- 
2.25.1



[PATCH] rs6000: Refine RTL unroll adjust hook

2020-07-06 Thread guojiufu via Gcc-patches
For very small loops (< 6 insns), it would be fine to unroll 4
times to use cache line better.  Like below loops:
 `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`

And for very complex loops which may cause negative impacts:
branch-miss or cache-miss. Like below loop: there are calls,
early exits and branches in loop.
```
  for (int i = 0; i < n; i++) {
  int e = a[I];
 
  if (function_call(e))  break;
 
  }
```

This patch enhances RTL unroll for small loops and prevent to
unroll complex loops.

gcc/ChangeLog
2020-07-03  Jiufu Guo  

* config/rs6000/rs6000.c (rs6000_loop_unroll_adjust): Refine hook.
(rs6000_complex_loop_p): New function.
(num_loop_calls): New function.
---
 gcc/config/rs6000/rs6000.c | 46 +-
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 58f5d780603..a4874fa0efc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5130,22 +5130,56 @@ rs6000_destroy_cost_data (void *data)
   free (data);
 }
 
+/* Count the number of call insns in LOOP.  */
+static unsigned int
+num_loop_calls (struct loop *loop)
+{
+  basic_block *bbs;
+  rtx_insn *insn;
+  unsigned int i;
+  unsigned int call_ins_num = 0;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+FOR_BB_INSNS (bbs[i], insn)
+  if (CALL_P (insn))
+   call_ins_num++;
+
+  free (bbs);
+
+  return call_ins_num;
+}
+
+/* Return true if LOOP is too complex to be unrolled.  */
+static bool
+rs6000_complex_loop_p (struct loop *loop)
+{
+  unsigned call_num;
+
+  return loop->ninsns > 10
+&& (call_num = num_loop_calls (loop)) > 0
+&& (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
+&& !single_exit (loop);
+}
+
 /* Implement targetm.loop_unroll_adjust.  */
 
 static unsigned
 rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
 {
-   if (unroll_only_small_loops)
+  if (unroll_only_small_loops)
 {
-  /* TODO: This is hardcoded to 10 right now.  It can be refined, for
-example we may want to unroll very small loops more times (4 perhaps).
-We also should use a PARAM for this.  */
+  if (loop->ninsns <= 6)
+   return MIN (4, nunroll);
   if (loop->ninsns <= 10)
return MIN (2, nunroll);
-  else
-   return 0;
+
+  return 0;
 }
 
+  if (rs6000_complex_loop_p (loop))
+return 0;
+
   return nunroll;
 }
 
-- 
2.25.1



[PATCH 1/2] Introduce flag_cunroll_grow_size for cunroll

2020-05-28 Thread guojiufu via Gcc-patches
From: Jiufu Guo 

Currently GIMPLE complete unroller(cunroll) is checking
flag_unroll_loops and flag_peel_loops to see if allow size growth.
Beside affects curnoll, flag_unroll_loops also controls RTL unroler.
To have more freedom to control cunroll and RTL unroller, this patch
introduces flag_cunroll_grow_size.  With this patch, we can control
cunroll and RTL unroller indepently.

Bootstrap/regtest pass on powerpc64le. OK for trunk? And backport to
GCC10 after week?

gcc/ChangeLog
2020-02-28  Jiufu Guo  

* common.opt (flag_cunroll_grow_size): New flag.
* toplev.c (process_options): Set flag_cunroll_grow_size.
* tree-ssa-loop-ivcanon.c (pass_complete_unroll::execute):
Use flag_cunroll_grow_size.
---
 gcc/common.opt  | 4 
 gcc/toplev.c| 4 
 gcc/tree-ssa-loop-ivcanon.c | 3 +--
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 4464049fc1f..1d0fa7b1749 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2856,6 +2856,10 @@ funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops.
 
+funroll-completely-grow-size
+Var(flag_cunroll_grow_size) Init(2)
+; Control cunroll to allow size growth during complete unrolling
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 96316fbd23b..8d52358efdd 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1474,6 +1474,10 @@ process_options (void)
   if (flag_unroll_all_loops)
 flag_unroll_loops = 1;
 
+  /* Allow cunroll to grow size accordingly.  */
+  if (flag_cunroll_grow_size == AUTODETECT_VALUE)
+flag_cunroll_grow_size = flag_unroll_loops || flag_peel_loops;
+
   /* web and rename-registers help when run after loop unrolling.  */
   if (flag_web == AUTODETECT_VALUE)
 flag_web = flag_unroll_loops;
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 8ab6ab3330c..d6a4617a6a1 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -1603,8 +1603,7 @@ pass_complete_unroll::execute (function *fun)
  re-peeling the same loop multiple times.  */
   if (flag_peel_loops)
 peeled_loops = BITMAP_ALLOC (NULL);
-  unsigned int val = tree_unroll_loops_completely (flag_unroll_loops
-  || flag_peel_loops
+  unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size
   || optimize >= 3, true);
   if (peeled_loops)
 {
-- 
2.17.1



[PATCH 2/2] rs6000: allow cunroll to grow size according to -funroll-loop or -fpeel-loops

2020-05-28 Thread guojiufu via Gcc-patches
From: Jiufu Guo 

Previously, flag_unroll_loops was turned on at -O2 implicitly.  It
also turned on cunroll with allowance size increasing, and then cunroll
will unroll/peel the loop even the loop is complex like code in PR95018.
With this patch, size growth for cunroll is allowed if -funroll-loops
or -fpeel-loops is specified explicitly.

Bootstrap/regtest pass on powerpc64le. OK for trunk? And backport to
GCC10?

BR,
Jiufu

gcc/ChangeLog
2020-02-28  Jiufu Guo  

PR target/95018
* config/rs6000/rs6000.c (rs6000_option_override_internal):
Override flag_cunroll_grow_size.

---
 gcc/config/rs6000/rs6000.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 8435bc15d72..df6e03146cb 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4567,7 +4567,12 @@ rs6000_option_override_internal (bool global_init_p)
unroll_only_small_loops = 0;
  if (!global_options_set.x_flag_rename_registers)
flag_rename_registers = 1;
+ if (!global_options_set.x_flag_cunroll_grow_size)
+   flag_cunroll_grow_size = 1;
}
+  else
+   if (!global_options_set.x_flag_cunroll_grow_size)
+ flag_cunroll_grow_size = flag_peel_loops;
 
   /* If using typedef char *va_list, signal that
 __builtin_va_start (, 0) can be optimized to
-- 
2.17.1



[PATCH 2/2] rs6000: Turn on -frtl-unroll-loops instead -funroll-loops at -O2

2020-05-25 Thread guojiufu via Gcc-patches
Previously, turning -funroll-loops on at -O2, which also turn on
GIMPLE cunroll fully.  While cunroll unrolls some complex loops.

This patch turn on -frtl-unroll-loops at -O2 only, and continue to
use previous tuned rs6000 heurisitics for small loops.  While this
patch does not turn on GIMPLE cunroll any more.  We may tune
cunroll in near future at -O2.

In this patch, it become simpler to check/set -fweb, -frename-register
and -munroll-only-small-loops.  Together with -frtl-unroll-loops, -fweb
is useful, then turn -fweb on;  and -frename-registers is no need to
be checked, because it is affected by -frtl-unroll-loops.


Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
And backport to GCC10 together with the patch "Seperate -funroll-loops
 for GIMPLE unroller and RTL unroller"

Jiufu


gcc/ChangeLog
2020-05-25  Jiufu Guo   

PR target/95018
* common/config/rs6000/rs6000-common.c
(rs6000_option_optimization_table)
[OPT_LEVELS_2_PLUS_SPEED_ONLY]: Replace -funroll-loops
with -frtl-unroll-loops.  Remove -munroll-only-small-loops
and add -fweb.
[OPT_LEVELS_ALL]: Remove turn off -frename-registers.
* config/rs6000/rs6000.c (rs6000_option_override_internal):
-funroll-loops overrides -munroll-only-small-loops and
-frtl-unroll-loops.
---
 gcc/common/config/rs6000/rs6000-common.c | 11 +++
 gcc/config/rs6000/rs6000.c   | 21 ++---
 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/gcc/common/config/rs6000/rs6000-common.c 
b/gcc/common/config/rs6000/rs6000-common.c
index ee37b9dc90b..c7388edb867 100644
--- a/gcc/common/config/rs6000/rs6000-common.c
+++ b/gcc/common/config/rs6000/rs6000-common.c
@@ -34,14 +34,9 @@ static const struct default_options 
rs6000_option_optimization_table[] =
 { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
 /* Enable -fsched-pressure for first pass instruction scheduling.  */
 { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
-/* Enable -munroll-only-small-loops with -funroll-loops to unroll small
-   loops at -O2 and above by default.  */
-{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_funroll_loops, NULL, 1 },
-{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
-
-/* -frename-registers leads to non-optimal codegen and performance
-   on rs6000, turn it off by default.  */
-{ OPT_LEVELS_ALL, OPT_frename_registers, NULL, 0 },
+/* Enable -frtl-unroll-loops and -fweb at -O2 and above by default.  */
+{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_frtl_unroll_loops, NULL, 1 },
+{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_fweb, NULL, 1 },
 
 /* Double growth factor to counter reduced min jump length.  */
 { OPT_LEVELS_ALL, OPT__param_max_grow_copy_bb_insns_, NULL, 16 },
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 8435bc15d72..96620651a59 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4557,17 +4557,16 @@ rs6000_option_override_internal (bool global_init_p)
   param_sched_pressure_algorithm,
   SCHED_PRESSURE_MODEL);
 
-  /* Explicit -funroll-loops turns -munroll-only-small-loops off, and
-turns -frename-registers on.  */
-  if ((global_options_set.x_flag_unroll_loops && flag_unroll_loops)
-  || (global_options_set.x_flag_unroll_all_loops
-  && flag_unroll_all_loops))
-   {
- if (!global_options_set.x_unroll_only_small_loops)
-   unroll_only_small_loops = 0;
- if (!global_options_set.x_flag_rename_registers)
-   flag_rename_registers = 1;
-   }
+  /* if -f[no-]unroll-loops is specified explicitly, turn [off/]on
+-frtl-unroll-loops.  */
+  if (global_options_set.x_flag_unroll_loops
+ && !global_options_set.x_flag_rtl_unroll_loops)
+   flag_rtl_unroll_loops = flag_unroll_loops;
+   
+  /* If flag_unroll_loops is effect, not _only_ small loops, but
+large loops are unrolled if possible.  */
+  if (!global_options_set.x_unroll_only_small_loops)
+   unroll_only_small_loops = flag_unroll_loops ? 0 : 1;
 
   /* If using typedef char *va_list, signal that
 __builtin_va_start (, 0) can be optimized to
-- 
2.17.1



[PATCH 1/2] Seperate -funroll-loops for GIMPLE unroller and RTL unroller

2020-05-25 Thread guojiufu via Gcc-patches
Currently option -funroll-loops controls both GIMPLE unroler and
RTL unroller. It is not able to control GIMPLE cunroller and
RTL unroller independently.  This patch introducing different flags
to control them seperately, and this also provide more freedom to
tune one of them without affecting another.

This patch introduces two undocumented flags: -fcomplete-unroll-loops
for GIMPLE cunroll, and -frtl-unroll-loops for RTL unroller.  And
these two options are enabled by original -funroll-loops.

Bootstrap and regtest pass on powerpc64le, is this ok for trunk?

Jiufu

ChangeLog:
2020-05-25  Jiufu Guo   

* common.opt: Add -frtl-unroll-loops and -fcomplete-unroll-loops.
* opts.c (enable_fdo_optimizations): Replace flag_unroll_loops
with flag_complete_unroll_loops.
* toplev.c (process_options): set flag_rtl_unroll_loops and
flag_complete_unroll_loops if not explicitly set by user.
* tree-ssa-loop-ivcanon.c (pass_complete_unroll::execute): Replace
flag_unroll_loops with flag_complete_unroll_loops.
* loop-init.c (pass_loop2::gate): Replace flag_unroll_loops with
flag_rtl_unroll_loops.
(pass_rtl_unroll_loops::gate): Replace flag_unroll_loops with
flag_rtl_unroll_loops.
---
 gcc/common.opt  | 8 
 gcc/loop-init.c | 6 +++---
 gcc/opts.c  | 2 +-
 gcc/toplev.c| 6 ++
 gcc/tree-ssa-loop-ivcanon.c | 2 +-
 5 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 4464049fc1f..3b5ab52bb9d 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2856,6 +2856,14 @@ funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops.
 
+frtl-unroll-loops
+Common Undocumented Var(flag_rtl_unroll_loops) Init(2) Optimization
+; Perform rtl loop unrolling when iteration count is known.
+
+fcomplete-unroll-loops
+Common Undocumented Var(flag_complete_unroll_loops) Init(2) Optimization
+; Perform GIMPLE loop complete unrolling.
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index 401e5282907..e955068f36c 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -360,7 +360,7 @@ pass_loop2::gate (function *fun)
   if (optimize > 0
   && (flag_move_loop_invariants
  || flag_unswitch_loops
- || flag_unroll_loops
+ || flag_rtl_unroll_loops
  || (flag_branch_on_count_reg && targetm.have_doloop_end ())
  || cfun->has_unroll))
 return true;
@@ -560,7 +560,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
 {
-  return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
+  return (flag_rtl_unroll_loops || flag_unroll_all_loops || 
cfun->has_unroll);
 }
 
   virtual unsigned int execute (function *);
@@ -576,7 +576,7 @@ pass_rtl_unroll_loops::execute (function *fun)
   if (dump_file)
df_dump (dump_file);
 
-  if (flag_unroll_loops)
+  if (flag_rtl_unroll_loops)
flags |= UAP_UNROLL;
   if (flag_unroll_all_loops)
flags |= UAP_UNROLL_ALL;
diff --git a/gcc/opts.c b/gcc/opts.c
index ec3ca0720f9..64c35d8d7fc 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -1702,7 +1702,7 @@ enable_fdo_optimizations (struct gcc_options *opts,
 {
   SET_OPTION_IF_UNSET (opts, opts_set, flag_branch_probabilities, value);
   SET_OPTION_IF_UNSET (opts, opts_set, flag_profile_values, value);
-  SET_OPTION_IF_UNSET (opts, opts_set, flag_unroll_loops, value);
+  SET_OPTION_IF_UNSET (opts, opts_set, flag_complete_unroll_loops, value);
   SET_OPTION_IF_UNSET (opts, opts_set, flag_peel_loops, value);
   SET_OPTION_IF_UNSET (opts, opts_set, flag_tracer, value);
   SET_OPTION_IF_UNSET (opts, opts_set, flag_value_profile_transformations,
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 96316fbd23b..c2b94d33464 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1474,6 +1474,12 @@ process_options (void)
   if (flag_unroll_all_loops)
 flag_unroll_loops = 1;
 
+  if (flag_rtl_unroll_loops == AUTODETECT_VALUE)
+flag_rtl_unroll_loops = flag_unroll_loops;
+
+  if (flag_complete_unroll_loops == AUTODETECT_VALUE)
+flag_complete_unroll_loops = flag_unroll_loops;
+
   /* web and rename-registers help when run after loop unrolling.  */
   if (flag_web == AUTODETECT_VALUE)
 flag_web = flag_unroll_loops;
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 8ab6ab3330c..cd5df353df5 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -1603,7 +1603,7 @@ pass_complete_unroll::execute (function *fun)
  re-peeling the same loop multiple times.  */
   if (flag_peel_loops)
 peeled_loops = BITMAP_ALLOC (NULL);
-  unsigned int val = tree_unroll_loops_completely (flag_unroll_loops
+  

[PATCH,rs6000] enable -fweb for small loops unrolling

2020-04-20 Thread guojiufu via Gcc-patches
Hi,

Previously -fweb was disabled if only unroll small loops.  After that
we find there is cases where it could help to rename pseudos and aovid
some anti-dependence which may occur after unroll.

Below is a patch to disable -fweb during loop unrolling.

Bootstrap and regtest pass on powerpc, spec2017 run shows no performance
downgrade.

Is this ok for trunk?

BR,
Jiufu

gcc/
2020-04-20  Jiufu Guo   

* common/config/rs6000/rs6000-common.c
(rs6000_option_optimization_table)[OPT_LEVELS_ALL]: Remove turn off
-fweb.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Avoid to
set flag_web. 


---
 gcc/common/config/rs6000/rs6000-common.c | 5 ++---
 gcc/config/rs6000/rs6000.c   | 4 +---
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/gcc/common/config/rs6000/rs6000-common.c 
b/gcc/common/config/rs6000/rs6000-common.c
index 4f38d566844..87f7fda940a 100644
--- a/gcc/common/config/rs6000/rs6000-common.c
+++ b/gcc/common/config/rs6000/rs6000-common.c
@@ -38,9 +38,8 @@ static const struct default_options 
rs6000_option_optimization_table[] =
loops at -O2 and above by default.  */
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_funroll_loops, NULL, 1 },
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
-/* -fweb and -frename-registers are useless in general for rs6000,
-   turn them off.  */
-{ OPT_LEVELS_ALL, OPT_fweb, NULL, 0 },
+
+/* -frename-registers is not very helpful for rs6000, turn it off.  */
 { OPT_LEVELS_ALL, OPT_frename_registers, NULL, 0 },
 
 /* Double growth factor to counter reduced min jump length.  */
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index a2992e682c8..6a9e701bd60 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4543,7 +4543,7 @@ rs6000_option_override_internal (bool global_init_p)
   SCHED_PRESSURE_MODEL);
 
   /* Explicit -funroll-loops turns -munroll-only-small-loops off, and
-turns -fweb and -frename-registers on.  */
+turns -frename-registers on.  */
   if ((global_options_set.x_flag_unroll_loops && flag_unroll_loops)
   || (global_options_set.x_flag_unroll_all_loops
   && flag_unroll_all_loops))
@@ -4552,8 +4552,6 @@ rs6000_option_override_internal (bool global_init_p)
unroll_only_small_loops = 0;
  if (!global_options_set.x_flag_rename_registers)
flag_rename_registers = 1;
- if (!global_options_set.x_flag_web)
-   flag_web = 1;
}
 
   /* If using typedef char *va_list, signal that
-- 
2.17.1