Re: [PATCH] avoid transform at run until wrap comparesion

2021-09-02 Thread Bin.Cheng via Gcc-patches
On Thu, Sep 2, 2021 at 6:18 PM Richard Biener  wrote:
>
> On Thu, 2 Sep 2021, Jiufu Guo wrote:
>
> > When transform
> >   {iv0.base, iv0.step} LT/LE {iv1.base, iv1.step}
> > to
> >   {iv0.base, iv0.step - iv1.step} LT/LE {iv1.base, 0}
> >
> > There would be error if 'iv0.step - iv1.step' in negative,
> > for which means run until wrap/overflow.
> >
> > For example:
> >{1, +, 1} <= {4, +, 3} => {1, +, -2} <= {4, +, 0}
> >
> > This patch avoid this kind transform.
> >
> > Bootstrap and regtest pass on ppc64le.
> > Is this ok for trunk?
>
> This looks like PR100740, see the discussion starting at
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571570.html
>
> We seem to be at a dead end figuring what's exactly required
> to make the transform valid and I have my doubts that arguing
> with overflow we're not running in circles.
>
> My last attempt was
>
> +  if (code != NE_EXPR)
> +   {
> + if (TREE_CODE (step) != INTEGER_CST)
> +   return false;
> + if (!iv0->no_overflow || !iv1->no_overflow)
> +   return false;
> + /* The new IV does not overflow if the step has the same
> +sign and is less in magnitude.  */
> + if (tree_int_cst_sign_bit (iv0->step) != tree_int_cst_sign_bit
> (step)
> + || wi::geu_p (wi::abs (wi::to_wide (step)),
> +   wi::abs (wi::to_wide (iv0->step
> +   return false;
> +   }
>
> where your patch at least misses { 1, +, -1 } < { -3, + , -3 }
> transforming to { 1, +, +2 } < -3, thus a positive step but
> we're still iterating unti wrap?  That is, the sign of the
> combined step cannot really be the issue at hand.
>
> Bin argued my condition is too strict and I agreed, see
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574334.html
> which is the last mail in the thread sofar (still without a fix :/)
>
> Somewhere it was said that we eventually should simply put
> preconditions into ->assumptions rather than trying to
> check ->no_overflow and friends, not sure if that's really
I did some experiments which can fix the PRs, however it causes new
failures in graphite possibly because of missing cases.  However, I
didn't have time looking into graphite tests back in time.

Thanks,
bin
> applicable here.
>
> Richard.
>
> > BR.
> > Jiufu
> >
> > gcc/ChangeLog:
> >
> > 2021-09-02  Jiufu Guo  
> >
> >   PR tree-optimization/102131
> >   * tree-ssa-loop-niter.c (number_of_iterations_cond):
> >   Avoid transform until wrap condition
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2021-09-02  Jiufu Guo  
> >
> >   PR tree-optimization/102131
> >   * gcc.dg/pr102131.c: New test.
> >
> > ---
> >  gcc/tree-ssa-loop-niter.c   | 18 +
> >  gcc/testsuite/gcc.dg/pr102131.c | 69 +
> >  2 files changed, 87 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.dg/pr102131.c
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index 7af92d1c893..52ce517af89 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -1866,6 +1866,24 @@ number_of_iterations_cond (class loop *loop,
> > || !iv0->no_overflow || !iv1->no_overflow))
> >   return false;
> >
> > +  /* GT/GE has been transformed to LT/LE already.
> > + cmp_code could be LT, LE or NE
> > +
> > + For LE/LT transform
> > + {iv0.base, iv0.step} LT/LE {iv1.base, iv1.step}
> > + to
> > + {iv0.base, iv0.step - iv1.step} LT/LE {iv1.base, 0}
> > + Negative iv0.step - iv1.step means decreasing until wrap,
> > + then the transform is not accurate.
> > +
> > + For example:
> > + {1, +, 1} <= {4, +, 3}
> > + is not same with
> > + {1, +, -2} <= {4, +, 0}
> > +   */
> > +  if ((code == LE_EXPR || code == LT_EXPR) && tree_int_cst_sign_bit 
> > (step))
> > + return false;
> > +
> >iv0->step = step;
> >if (!POINTER_TYPE_P (type))
> >   iv0->no_overflow = false;
> > diff --git a/gcc/testsuite/gcc.dg/pr102131.c 
> > b/gcc/testsuite/gcc.dg/pr102131.c
> > new file mode 100644
> > index 000..0fcfaa132da
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr102131.c
> > @@ -0,0 +1,69 @@
> > +/* { dg-do run } */
> > +
> > +unsigned int a;
> > +int
> > +fun1 ()
> > +{
> > +  unsigned b = 0;
> > +  int c = 1;
> > +  for (; b < 3; b++)
> > +{
> > +  while (c < b)
> > + __builtin_abort ();
> > +  for (a = 0; a < 3; a++)
> > + c++;
> > +}
> > +  return 0;
> > +}
> > +
> > +unsigned b;
> > +int
> > +fun2 ()
> > +{
> > +  unsigned c = 0;
> > +  for (a = 0; a < 2; a++)
> > +for (b = 0; b < 2; b++)
> > +  if (++c < a)
> > + __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +int
> > +fun3 (void)
> > +{
> > +  unsigned a, b, c = 0;
> > +  for (a = 0; a < 10; a++)
> > +{
> > +  for (b = 0; b < 2; b++)
> > + {
> > +   c++;
> > +   if (c < a)
> > + {
> > +   __builtin_abort 

Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-24 Thread Bin.Cheng via Gcc-patches
On Wed, Aug 25, 2021 at 11:26 AM guojiufu  wrote:
>
> On 2021-08-16 09:33, Bin.Cheng wrote:
> > On Wed, Aug 4, 2021 at 10:42 AM guojiufu 
> > wrote:
> >>
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> new file mode 100644
> >> >> index 000..6eed3fa8aca
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> @@ -0,0 +1,63 @@
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +foo_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE; n < l; l += C)
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +bar_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE_DOWN; l < n; l -= C)
> > I noticed that both L_BASE and L_BASE_DOWN are defined as l, which
> > makes this test a bit confusing.  Could you clean the use of l, for
> > example, by using an auto var for the loop index invariable?
> > Otherwise the patch looks good to me.  Thanks very much for the work.
>
> Hi,
>
> Sorry for bothering you here.
> I feel this would be an approval (with the comment) already :)
>
> With the change code to make it a little clear as:
>TYPE i;
>for (i = l; n < i; i += C)
>
> it may be ok to commit the patch to the trunk, right?
Yes please.  Thanks again for working on this.

Thanks,
bin
>
> BR,
> Jiufu
>
> >
> > Thanks,
> > bin
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +int __attribute__ ((noinline)) neq (int a, int b) { return a != b; }
> >> >> +
> >> >> +int a[1000], b[1000];
> >> >> +int fail;
> >> >> +
> >> >> +int
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> new file mode 100644
> >> >> index 000..94f6b99b893
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> @@ -0,0 +1,15 @@
> >> >> +/* { dg-require-effective-target vect_int } */
> >> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> >> +#define TYPE signed char
> >> >> +#define MIN -128
> >> >> +#define MAX 127
> >> >> +#define N_BASE (MAX - 32)
> >> >> +#define N_BASE_DOWN (MIN + 32)
> >> >> +
> >> >> +#define C 3
> >> >> +#define L_BASE l
> >> >> +#define L_BASE_DOWN l
> >> >> +


Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-15 Thread Bin.Cheng via Gcc-patches
On Wed, Aug 4, 2021 at 10:42 AM guojiufu  wrote:
>
> Hi,
>
> I would like to have a ping on this.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574596.html
Sorry for being late in replying.

>
> BR,
> Jiufu
>
> On 2021-07-15 08:17, guojiufu via Gcc-patches wrote:
> > Hi,
> >
> > I would like to have an early ping on this with more mail addresses.
> >
> > BR,
> > Jiufu.
> >
> > On 2021-07-07 20:47, Jiufu Guo wrote:
> >> Changes since v1:
> >> * Update assumptions for niter, add more test cases check
> >> * Use widest_int/wide_int instead mpz to do +-/
> >> * Move some early check for quick return
> >>
> >> For code like:
> >> unsigned foo(unsigned val, unsigned start)
> >> {
> >>   unsigned cnt = 0;
> >>   for (unsigned i = start; i > val; ++i)
> >> cnt++;
> >>   return cnt;
> >> }
> >>
> >> The number of iterations should be about UINT_MAX - start.
> >>
> >> There is function adjust_cond_for_loop_until_wrap which
> >> handles similar work for const bases.
> >> Like adjust_cond_for_loop_until_wrap, this patch enhance
> >> function number_of_iterations_cond/number_of_iterations_lt
> >> to analyze number of iterations for this kind of loop.
> >>
> >> Bootstrap and regtest pass on powerpc64le, x86_64 and aarch64.
> >> Is this ok for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >> 2021-07-07  Jiufu Guo  
> >>
> >>  PR tree-optimization/101145
> >>  * tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
> >>  New function.
> >>  (number_of_iterations_lt): Invoke above function.
> >>  (adjust_cond_for_loop_until_wrap):
> >>  Merge to number_of_iterations_until_wrap.
> >>  (number_of_iterations_cond): Update invokes for
> >>  adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 2021-07-07  Jiufu Guo  
> >>
> >>  PR tree-optimization/101145
> >>  * gcc.dg/vect/pr101145.c: New test.
> >>  * gcc.dg/vect/pr101145.inc: New test.
> >>  * gcc.dg/vect/pr101145_1.c: New test.
> >>  * gcc.dg/vect/pr101145_2.c: New test.
> >>  * gcc.dg/vect/pr101145_3.c: New test.
> >>  * gcc.dg/vect/pr101145inf.c: New test.
> >>  * gcc.dg/vect/pr101145inf.inc: New test.
> >>  * gcc.dg/vect/pr101145inf_1.c: New test.
> >> ---
> >>  gcc/testsuite/gcc.dg/vect/pr101145.c  | 187
> >> ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145.inc|  63 
> >>  gcc/testsuite/gcc.dg/vect/pr101145_1.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_2.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_3.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf.c   |  25 +++
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf.inc |  28 
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf_1.c |  23 +++
> >>  gcc/tree-ssa-loop-niter.c | 157 ++
> >>  9 files changed, 463 insertions(+), 65 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> new file mode 100644
> >> index 000..74031b031cf
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> @@ -0,0 +1,187 @@
> >> +/* { dg-require-effective-target vect_int } */
> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> +#include 
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l,
> >> unsigned)
> >> +{
> >> +  while (UINT_MAX - 64 < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  l = UINT_MAX - 32;
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n <= ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{  // infininate
> >> +  while (0 <= ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_5 (int 

[PATCH AArch64]Fix expanding of %w for *extend... pattern

2021-08-09 Thread bin.cheng via Gcc-patches
Hi,
When playing with std::experimental::simd, I found a bug newly introduced in
AArch64 backend.  As commit message describes:
  7 Pattern "*extend2_aarch64" is duplicated
  8 from the corresponding zero_extend pattern, however % needs
  9 to be expanded according to its mode iterator because the smov
 10 instruction is different to umov.

Without the fix, it forces expanding to w register no matter the mode is.  
This simple patch fixes the issue which should be obvious.

Bootstrap and test on AArch64.  Any comments?

Thanks,
bin

pattern-signed-extend-20210809.patch
Description: Binary data


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-27 Thread Bin.Cheng via Gcc-patches
On Mon, Jul 26, 2021 at 11:07 PM Jeff Law  wrote:
>
>
>
> On 7/25/2021 7:47 PM, Bin.Cheng wrote:
> > On Sat, Jul 24, 2021 at 12:30 AM Jeff Law via Gcc-patches
> >  wrote:
> >>
> >>
> >> On 7/14/2021 3:14 AM, bin.cheng via Gcc-patches wrote:
> >>> Hi,
> >>> I ran into a wrong code bug in code with deep template instantiation when 
> >>> working on sdx::simd.
> >>> The root cause as described in commit summary is we skip prologue insns 
> >>> in init_alias_analysis.
> >>> This simple patch fixes the issue, however, it's hard to reduce a case 
> >>> because of heavy use of
> >>> templates.
> >>> Bootstrap and test on x86_64, is it OK?
> >> It's a clear correctness improvement, but what's unclear to me is why
> >> we'd want to skip them in the epilogue either.
> > I can only guess, there is nothing to initialize epilogue for because
> > no code follows.
> Yea, but couldn't the lack of analysis of the epilogue lead to a pass
> mis-optimizing code within the epilogue itself?  It's not terribly
> likely, but it just seems wrong to skip the epilogue like this.
> Remember, the aliasing bits are just an analysis phase to find the
> aliasing relationships that exist and we don't necessarily know how that
> data is going to be used.  It may in fact be safe now, but may not be
> safe in the future if someone added a late RTL pass that used the
> aliasing info in a new way.
>
> The more I think about it, the more I think we should remove remove this
> hunk of code completely.  There is some chance for fallout, but I think
> it's unlikely.
Hi Jeff,
Thanks for the suggestion, here is the simple patch removing all of it.
diff --git a/gcc/alias.c b/gcc/alias.c
index 69e1eb89ac6..099acabca6b 100644
--- a/gcc/alias.c
+++ b/gcc/alias.c
@@ -3406,14 +3406,6 @@ init_alias_analysis (void)
   rpo = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
   rpo_cnt = pre_and_rev_post_order_compute (NULL, rpo, false);

-  /* The prologue/epilogue insns are not threaded onto the
- insn chain until after reload has completed.  Thus,
- there is no sense wasting time checking if INSN is in
- the prologue/epilogue until after reload has completed.  */
-  bool could_be_prologue_epilogue = ((targetm.have_prologue ()
- || targetm.have_epilogue ())
-&& reload_completed);
-
   pass = 0;
   do
 {
@@ -3459,10 +3451,6 @@ init_alias_analysis (void)
{
  rtx note, set;

- if (could_be_prologue_epilogue
- && prologue_epilogue_contains (insn))
-   continue;
-
  /* If this insn has a noalias note, process it,  Otherwise,
 scan for sets.  A simple set will have no side effects
 which could change the base value of any other
register.  */

No fallouts in bootstrap/test on x86_64.  Is it OK?

Thanks,
bin
>
> Jeff
>


Re: [PATCH][RFC] tree-optimization/100499 - multiple_of_p bad behavior wrt niter analysis

2021-07-25 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 22, 2021 at 6:41 PM Richard Biener  wrote:
>
> This avoids using multiple_of_p in niter analysis when its behavior
Hmm, but this patch actually introduces one more call to
multiple_of_p, also it doesn't touch the below use:
if (niter->control.no_overflow && multiple_of_p (type, c, s))
  {
niter->niter = fold_build2 (FLOOR_DIV_EXPR, niter_type, c, s);
return true;
  }

> to assume the tested expression does not invoke integer overflow
> produces wrong niter analysis results.  For the cases multiple_of_p
> handles power-of-two values of bottom look safe which should also be
> the majority of cases we care about in addition to the constant case
> now handled explicitely.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> I'm unsure how important a "perfect" solution is (rewriting
> multiple_of_p), and wonder whether this solution is preferable
I am still for this approach now, it only needs to be conservative,
rather than perfect, especially if there are not many breakages with a
conservative version multiple_of_p?

> for now (and especially for branches).  I've not yet tried
> to sanitize multiple_of_p plus use range info to prove
> no-overflow where TYPE_OVERFLOW_UNDEFINED doesn't tell us
> immediately.
>
> 2021-07-22  Richard Biener  
>
> PR tree-optimization/100499
> * tree-ssa-loop-niter.c (number_of_iterations_ne): Restrict
> multiple_of_p queries to power-of-two bottoms, handle
> the all constant case inline.
>
> * gcc.dg/torture/pr100499-1.c: New testcase.
> * gcc.dg/torture/pr100499-2.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/torture/pr100499-1.c | 28 +++
>  gcc/testsuite/gcc.dg/torture/pr100499-2.c | 16 +
>  gcc/tree-ssa-loop-niter.c |  8 ++-
>  3 files changed, 51 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr100499-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr100499-2.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr100499-1.c 
> b/gcc/testsuite/gcc.dg/torture/pr100499-1.c
> new file mode 100644
> index 000..97ab6051554
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr100499-1.c
> @@ -0,0 +1,28 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target int32plus } */
> +
> +typedef __UINT16_TYPE__ uint16_t;
> +static uint16_t g_2823 = 0xEC75L;
> +static uint16_t g_116 = 0xBC07L;
> +
> +static uint16_t
> +safe_mul_func_uint16_t_u_u(uint16_t ui1, uint16_t ui2)
> +{
> +  return ((unsigned int)ui1) * ((unsigned int)ui2);
> +}
> +
> +int main (int argc, char* argv[])
> +{
> +  uint16_t l_2815 = 65535UL;
> +  uint16_t *l_2821 = _116;
> +  uint16_t *l_2822 = _2823;
> +
> +lbl_2826:
> +  l_2815 &= 0x9DEF1EAEL;
> +  if (+(safe_mul_func_uint16_t_u_u(((*l_2821) = l_2815), (--(*l_2822)
> +goto lbl_2826;
> +
> +  if (g_2823 != 32768)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/torture/pr100499-2.c 
> b/gcc/testsuite/gcc.dg/torture/pr100499-2.c
> new file mode 100644
> index 000..999f931806a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr100499-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do run } */
> +
> +unsigned char ag = 55;
> +unsigned i;
> +int main()
> +{
> +  unsigned char c;
> +  unsigned char a = ag;
> +d:
> +  c = a-- * 52;
> +  if (c)
> +goto d;
> +  if (a != 255)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index 1b5605c26b8..c6b953c5316 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -1050,7 +1050,13 @@ number_of_iterations_ne (class loop *loop, tree type, 
> affine_iv *iv,
>   Note, for NE_EXPR, base equals to FINAL is a special case, in
>   which the loop exits immediately, and the iv does not overflow.  */
>if (!niter->control.no_overflow
> -  && (integer_onep (s) || multiple_of_p (type, c, s)))
> +  && (integer_onep (s)
> + || (poly_int_tree_p (c)
> + && multiple_p (wi::to_poly_widest (c), wi::to_poly_widest (s)))
> + /* ???  multiple_of_p assumes the expression 'c' does not overflow
> +but that cannot be guaranteed, so we restrict 's' to power of
> +two values where that should not be an issue.  See PR100499.  */
> + || (integer_pow2p (s) && multiple_of_p (type, c, s
>  {
>tree t, cond, new_c, relaxed_cond = boolean_false_node;
I (to be blamed) added this part of code to special handle cases like
pr34114, now I feel it's in the wrong direction.  Ideally this part of
code is unnecessary and conditions will be (it is now) incorporated
into niter->assumptions which should be simplified to 1/0
correspondingly.  The only problem is that assumptions are not
appropriately simplified.
Is it possible to incorporate a more powerful solver (like Z3) in GCC
for such cases, e.g., assumption simplification, multiple_of_p, etc..
Oh, we don't do SCEV analysis 

Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-25 Thread Bin.Cheng via Gcc-patches
On Sat, Jul 24, 2021 at 12:30 AM Jeff Law via Gcc-patches
 wrote:
>
>
>
> On 7/14/2021 3:14 AM, bin.cheng via Gcc-patches wrote:
> > Hi,
> > I ran into a wrong code bug in code with deep template instantiation when 
> > working on sdx::simd.
> > The root cause as described in commit summary is we skip prologue insns in 
> > init_alias_analysis.
> > This simple patch fixes the issue, however, it's hard to reduce a case 
> > because of heavy use of
> > templates.
> > Bootstrap and test on x86_64, is it OK?
> It's a clear correctness improvement, but what's unclear to me is why
> we'd want to skip them in the epilogue either.
I can only guess, there is nothing to initialize epilogue for because
no code follows.

Thanks,
bin
>
> Jeff


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-23 Thread Bin.Cheng via Gcc-patches
On Fri, Jul 23, 2021 at 7:53 AM Segher Boessenkool
 wrote:
>
> On Wed, Jul 14, 2021 at 05:14:16PM +0800, bin.cheng via Gcc-patches wrote:
> > Hi,
> > I ran into a wrong code bug in code with deep template instantiation when 
> > working on sdx::simd.
> > The root cause as described in commit summary is we skip prologue insns in 
> > init_alias_analysis.
> > This simple patch fixes the issue, however, it's hard to reduce a case 
> > because of heavy use of
> > templates.
>
> > Subject: [PATCH 1/2] Don't skip prologue instructions as it could affect 
> > alias
> >  analysis
> >
> > In init_alias_analysis, we skip prologue instructions but this is
> > wrong as it could affect base value analysis for registers as well
> > as following analysis/transformations like cselib and dse:
> >   prologue:
> > x12 = 0x1810
> > sp = sp - x12
> >   ...
> > ...
> > x12 = SYMBOL_REF(.LC89)
> > Here reg_base_value[12] is set to ".LC89", however this is only true
> > after the second instruction setting x12.  The patch fixes the issue
> > by just handling prologue instructions as normal.  Though ideally it
> > can be improved in context-sensitive way.
>
> In what pass do you get the bad behaviour?  dse2?  postreload?  Or what
> else?
Hi Segher,
It is dse2 deleting non dead stores based on wrong cse info again
based on wrong alias info.
The code flow is quite tricky, given insn list like:
 x12 = 0x1810
 sp = sp - x12
 ...
 [sp + offset] = x
 y = [sp + offset]
 ...
 [sp + offset] = z
 ...
 x12 = SYMBOL_REF(.LC89)

1、alias computes wrong reg_base_value[12] = symbol_ref(.LC89)
2、when dse2 process "y = [sp + offset]", the calling sequence is :
  scan_insn
-> check_mem_read_rtx
  -> check_mem_read_rtx
-> canon_true_dependence(group_id == -1)
  -> true_dependence_1 which has below code:
--
  base = find_base_term (x_addr);
  if (base && (GET_CODE (base) == LABEL_REF
   || (GET_CODE (base) == SYMBOL_REF
   && CONSTANT_POOL_ADDRESS_P (base
return 0;
x_addr is "sp - x12", sp has no base term, however x12 has
symbol_ref(.LC89) base term, and the code returns 0.  As a result,
dse2 considers storing x as dead when processing "[sp + offset] = z".

Sorry I can't reproduce a case out of it.

Thanks,
bin

>
> Your patch looks correct, but I'd like to know why it has seemed to work
> for so long :-)
>
>
> Segher


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-22 Thread Bin.Cheng via Gcc-patches
Gentle ping.  Any suggestions would be appreciated.

Thanks,
bin

On Wed, Jul 14, 2021 at 5:15 PM bin.cheng via Gcc-patches
 wrote:
>
> Hi,
> I ran into a wrong code bug in code with deep template instantiation when 
> working on sdx::simd.
> The root cause as described in commit summary is we skip prologue insns in 
> init_alias_analysis.
> This simple patch fixes the issue, however, it's hard to reduce a case 
> because of heavy use of
> templates.
> Bootstrap and test on x86_64, is it OK?
>
> Thanks,
> bin


Re: [committed] add test for PR 86650

2021-07-20 Thread Bin.Cheng via Gcc-patches
On Wed, Jul 7, 2021 at 5:39 AM Martin Sebor via Gcc-patches
 wrote:
>
> The recent patch series to improve warning suppression for inlined
> functions [PR98512] also implicitly includes the inlining context
> in all warning messages for inlined code.  In r12-2091 I have
> committed the attached test to verify that -Warray-bounds too
> includes this context (its absence its the subject of PR 86650).
Hi,
It seems this patch exposes/causes uninitialized warning in arm_neon.h
like the following one:

__extension__ extern __inline void
__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
vst2q_s32 (int32_t * __a, int32x4x2_t __val)
{
  __builtin_aarch64_simd_oi __o;
  __o = __builtin_aarch64_set_qregoiv4si (__o, (int32x4_t) __val.val[0], 0);
  __o = __builtin_aarch64_set_qregoiv4si (__o, (int32x4_t) __val.val[1], 1);
  __builtin_aarch64_st2v4si ((__builtin_aarch64_simd_si *) __a, __o);
}

Thanks,
bin


[PATCH AArch64]Use stable sort in generating ldp/stp

2021-07-14 Thread bin.cheng via Gcc-patches
Hi,
Like previous patch, this is found when I was playing with stx::simd.  It's an 
obvious
change as described in commit summary.  Also the dead store in the code should 
be
optimized away, but I guess there is no guarantee, so here is a simple patch 
fixing it.


Is it OK?

Thanks,
bin

0002-AArch64-use-stable-sorting-in-generating-ldp-stp.patch
Description: Binary data


0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-14 Thread bin.cheng via Gcc-patches
Hi,
I ran into a wrong code bug in code with deep template instantiation when 
working on sdx::simd.
The root cause as described in commit summary is we skip prologue insns in 
init_alias_analysis.
This simple patch fixes the issue, however, it's hard to reduce a case because 
of heavy use of
templates.
Bootstrap and test on x86_64, is it OK?

Thanks,
bin

0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch
Description: Binary data


Re: [PATCH] New hook adjust_iv_update_pos

2021-07-06 Thread Bin.Cheng via Gcc-patches
On Mon, Jun 28, 2021 at 4:26 PM Richard Biener
 wrote:
>
> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo  wrote:
> >
> >
> >
> > On 2021/6/25 18:02, Richard Biener wrote:
> > > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo  wrote:
> > >>
> > >>
> > >>
> > >> On 2021/6/25 16:54, Richard Biener wrote:
> > >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> > >>>  wrote:
> > 
> >  From: Xiong Hu Luo 
> > 
> >  adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
> >  on Power.  For example, it generates mismatched address offset after
> >  adjust iv update statement position:
> > 
> >   [local count: 70988443]:
> >  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >  ivtmp.30_415 = ivtmp.30_414 + 1;
> >  _34 = ref_180 + 18446744073709551615;
> >  _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >  if (_84 == _86)
> >  goto ; [94.50%]
> >  else
> >  goto ; [5.50%]
> > 
> >  Disable it will produce:
> > 
> >   [local count: 70988443]:
> >  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >  _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> >  ivtmp.30_415 = ivtmp.30_414 + 1;
> >  if (_84 == _86)
> >  goto ; [94.50%]
> >  else
> >  goto ; [5.50%]
> > 
> >  Then later pass loop unroll could benefit from same address offset
> >  with different base address and reduces register dependency.
> >  This patch could improve performance by 10% for typical case on Power,
> >  no performance change observed for X86 or Aarch64 due to small loops
> >  not unrolled on these platforms.  Any comments?
> > >>>
> > >>> The case you quote is special in that if we hoisted the IV update before
> > >>> the other MEM _also_ used in the condition it would be fine again.
> > >>
> > >> Thanks.  I tried to hoist the IV update statement before the first MEM 
> > >> (Fix 2), it
> > >> shows even worse performance due to not unroll(two more "base-1" is 
> > >> generated in gimple,
> > >> then loop->ninsns is 11 so small loops is not unrolled), change the 
> > >> threshold from
> > >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, 
> > >> the
> > >> performance is SAME to the one that IV update statement in the *MIDDLE* 
> > >> (trunk).
> > >>  From the ASM, we can see the index register %r4 is used in two 
> > >> iterations which
> > >> maybe a bottle neck for hiding instruction latency?
> > >>
> > >> Then it seems reasonable the performance would be better if keep the IV 
> > >> update
> > >> statement at *LAST* (Fix 1).
> > >>
> > >> (Fix 2):
> > >> [local count: 70988443]:
> > >>ivtmp.30_415 = ivtmp.30_414 + 1;
> > >>_34 = ip_229 + 18446744073709551615;
> > >>_84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> > >>_33 = ref_180 + 18446744073709551615;
> > >>_86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> > >>if (_84 == _86)
> > >>  goto ; [94.50%]
> > >>else
> > >>  goto ; [5.50%]
> > >>
> > >>
> > >> .L67:
> > >>  lbzx %r12,%r24,%r4
> > >>  lbzx %r25,%r7,%r4
> > >>  cmpw %cr0,%r12,%r25
> > >>  bne %cr0,.L11
> > >>  mr %r26,%r4
> > >>  addi %r4,%r4,1
> > >>  lbzx %r12,%r24,%r4
> > >>  lbzx %r25,%r7,%r4
> > >>  mr %r6,%r26
> > >>  cmpw %cr0,%r12,%r25
> > >>  bne %cr0,.L11
> > >>  mr %r26,%r4
> > >> .L12:
> > >>  cmpdi %cr0,%r10,1
> > >>  addi %r4,%r26,1
> > >>  mr %r6,%r26
> > >>  addi %r10,%r10,-1
> > >>  bne %cr0,.L67
> > >>
> > >>>
> > >>> Now, adjust_iv_update_pos doesn't seem to check that the
> > >>> condition actually uses the IV use stmt def, so it likely applies to
> > >>> too many cases.
> > >>>
> > >>> Unfortunately the introducing rev didn't come with a testcase,
> > >>> but still I think fixing up adjust_iv_update_pos is better than
> > >>> introducing a way to short-cut it per target decision.
> > >>>
> > >>> One "fix" might be to add a check that either the condition
> > >>> lhs or rhs is the def of the IV use and the other operand
> > >>> is invariant.  Or if it's of similar structure hoist across the
> > >>> other iv-use as well.  Not that I understand the argument
> > >>> about the overlapping life-range.
> > >>>
> > >>> You also don't provide a complete testcase ...
> > >>>
> > >>
> > >> Attached the test code, will also add it it patch in future version.
> > >> The issue comes from a very small hot loop:
> > >>
> > >>  do {
> > >>len++;
> > >>  } while(len < maxlen && ip[len] == ref[len]);
> > >
> > > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int 
> > > maxlen)
> > > {
> > >unsigned int len = 2;
> > >do {
> > >len++;
> > >}while(len < maxlen && ip[len] == ref[len]);
> > >return len;
> > > }
> > >
> > > I can see the effect on this loop on x86_64 as well, we end 

Re: [PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 8:19 PM Richard Biener
 wrote:
>
> On Mon, Jun 7, 2021 at 4:35 PM Richard Biener
>  wrote:
> >
> > On Sun, Jun 6, 2021 at 12:01 PM Bin.Cheng  wrote:
> > >
> > > On Wed, Jun 2, 2021 at 3:28 PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > On Tue, Jun 1, 2021 at 4:00 PM bin.cheng via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > > As described in patch summary, this fixes the wrong code issue by 
> > > > > adding overflow-ness
> > > > > check for iv1.step - iv2.step.
> > > > >
> > > > > Bootstrap and test on x86_64.  Any comments?
> > > >
> > > > + bool wrap_p = TYPE_OVERFLOW_WRAPS (step_type);
> > > > + if (wrap_p)
> > > > +   {
> > > > + tree t = fold_binary_to_constant (GE_EXPR, step_type,
> > > > +   iv0->step, iv1->step);
> > > > + wrap_p = integer_zerop (t);
> > > > +   }
> > > >
> > > > I think we can't use TYPE_OVERFLOW_WRAPS/TYPE_OVERFLOW_UNDEFINED since
> > > > that's only relevant for expressions written by the user - we're
> > > > computing iv0.step - iv1.step
> > > > which can even overflow when TYPE_OVERFLOW_UNDEFINED (in fact we may not
> > > > even generate this expression then!).  So I think we have to do sth like
> > > >
> > > >/* If the iv0->step - iv1->step wraps, fail.  */
> > > >if (!operand_equal_p (iv0->step, iv1->step)
> > > >&& (TREE_CODE (iv0->step) != INTEGER_CST || TREE_CODE
> > > > (iv1->step) != INTEGER_CST)
> > > >&& !wi::gt (wi::to_widest (iv0->step), wi::to_widest (iv1->step))
> > > >  return false;
> > > >
> > > > which only handles equality and all integer constant steps. You could
> > > Thanks for the suggestion.  I realized that we have LE/LT/NE
> > > conditions here, and for LE/LT what we need to check is iv0/iv1
> > > converge to each other, rather than diverge.  Also steps here can only
> > > be constants, so there is no need to use range information.
> >
> > Ah, that simplifies things.
> >
> > + if (tree_int_cst_lt (iv0->step, iv1->step))
> > +   return false;
> >
> > so it looks to me that iv?->step can be negative which means we should
> > verify that abs(iv0->step - iv1->step) <= abs (iv0->step), correct?
> >
> >tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
> >iv0->step, iv1->step);
> > ...
> > + if (TREE_CODE (step) != INTEGER_CST)
> > +   return false;
> >
> > note fold_binary_to_constant will return NULL if the result is not
> > TREE_CONSTANT (which would also include symbolic constants
> > like  - ).  It wasn't checked before, of course but since we're
> > touching the code we might very well be checking for NULL step ...
> > (or assert it is not for documentation purposes).
> >
> > That said, if iv0->step and iv1->step are known INTEGER_CSTs
> > (I think they indeed are given the constraints we impose on
> > simple_iv_with_niters).
> >
> > That said, with just a quick look again it looks to me the
> > IV1 {<=,<} IV2 transform to IV1 - IV2step {<=,<} IV2base
> > is OK whenever the effective step magnitude on the IV1'
> > decreases, thus abs(IV1.step - IV2.step) <= abs(IV1.step)
> > since then IV1 is still guaranteed to not overflow.  But
> > for example {0, +, 1} and {10, -, 1} also converge if the
> > number of iterations is less than 10 but they would not pass
> > this criteria.  So I'm not sure "convergence" is a good wording
> > here - but maybe I'm missing some critical piece of understanding
> > here.
> >
> > But in any case it looks like we're on the right track ;)
>
> Just to pick up things where we left them (and seeing the patch to
> unti-wrap which reminded me), I've digged in myself a bit and
> came to the following conclusion.
>
> The b0 + s0 < b1 + s1 -> b0 + s0 - s1 < b1 transform is only
> valid if b0 + s0 - s1 does not wrap which we can only ensure
> by ensuring that b0 + s0 and b1 + s1 do not wrap (which we
> already check) but also - what we're missing - that (s0 - s1)
> makes b0 still evolve in the same direction

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 10:15 PM guojiufu via Gcc-patches
 wrote:
>
> On 2021-07-01 20:35, Richard Biener wrote:
> > On Thu, 1 Jul 2021, Jiufu Guo wrote:
> >
> >> For code like:
> >> unsigned foo(unsigned val, unsigned start)
> >> {
> >>   unsigned cnt = 0;
> >>   for (unsigned i = start; i > val; ++i)
> >> cnt++;
> >>   return cnt;
> >> }
> >>
> >> The number of iterations should be about UINT_MAX - start.
> >
> > For
> >
> > unsigned foo(unsigned val, unsigned start)
> > {
> >   unsigned cnt = 0;
> >   for (unsigned i = start; i >= val; ++i)
> > cnt++;
> >   return cnt;
> > }
> >
> > and val == 0 the loop never terminates.  I don't see anywhere
> > in the patch that you disregard GE_EXPR and I remember
> > the code handles GE as well as GT?  From a quick look this is
> > also not covered by a testcase you add - not exactly sure
> > how it would materialize in a miscompilation.
>
> In number_of_iterations_cond, there is code:
> if (code == GE_EXPR || code == GT_EXPR
> || (code == NE_EXPR && integer_zerop (iv0->step)))
>{
>  std::swap (iv0, iv1);
>  code = swap_tree_comparison (code);
>}
> It converts "GT/GE" (i >= val) to "LT/LE" (val <= i),
> and LE (val <= i) is converted to LT (val - 1 < i).
> So, the code is added to number_of_iterations_lt.
>
> But, this patch leads mis-compilation for unsigned "i >= val" as
> above transforms: converting LE (val <= i) to LT (val - 1 < i)
> seems not appropriate (e.g where val=0).
I don't know where the exact code is, but IIRC, number_of_iteration
handles boundary conditions when transforming <= into <.  You may
check it out.

> Thanks for pointing out this!!!
>
> I would investigate a way to handle this correctly.
> A possible way maybe just to return false for this kind of LE.
IIRC, it checks the boundary conditions, either returns false or
simply introduces more assumptions.
>
> Any suggestions?
>
> >
> >> There is function adjust_cond_for_loop_until_wrap which
> >> handles similar work for const bases.
> >> Like adjust_cond_for_loop_until_wrap, this patch enhance
> >> function number_of_iterations_cond/number_of_iterations_lt
> >> to analyze number of iterations for this kind of loop.
> >>
> >> Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >>  PR tree-optimization/101145
> >>  * tree-ssa-loop-niter.c
> >>  (number_of_iterations_until_wrap): New function.
> >>  (number_of_iterations_lt): Invoke above function.
> >>  (adjust_cond_for_loop_until_wrap):
> >>  Merge to number_of_iterations_until_wrap.
> >>  (number_of_iterations_cond): Update invokes for
> >>  adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>  PR tree-optimization/101145
> >>  * gcc.dg/vect/pr101145.c: New test.
> >>  * gcc.dg/vect/pr101145.inc: New test.
> >>  * gcc.dg/vect/pr101145_1.c: New test.
> >>  * gcc.dg/vect/pr101145_2.c: New test.
> >>  * gcc.dg/vect/pr101145_3.c: New test.
> >> ---
> >>  gcc/testsuite/gcc.dg/vect/pr101145.c   | 187
> >> +
> >>  gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
> >>  gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
> >>  gcc/tree-ssa-loop-niter.c  | 150 +++-
> >>  6 files changed, 380 insertions(+), 65 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> new file mode 100644
> >> index 000..74031b031cf
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> @@ -0,0 +1,187 @@
> >> +/* { dg-require-effective-target vect_int } */
> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> +#include 
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l,
> >> unsigned)
> >> +{
> >> +  while (UINT_MAX - 64 < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  l = UINT_MAX - 32;
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_3 (int *__restrict__ a, int *__restrict__ b, 

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 10:06 AM Jiufu Guo via Gcc-patches
 wrote:
>
> For code like:
> unsigned foo(unsigned val, unsigned start)
> {
>   unsigned cnt = 0;
>   for (unsigned i = start; i > val; ++i)
> cnt++;
>   return cnt;
> }
>
> The number of iterations should be about UINT_MAX - start.
>
> There is function adjust_cond_for_loop_until_wrap which
> handles similar work for const bases.
> Like adjust_cond_for_loop_until_wrap, this patch enhance
> function number_of_iterations_cond/number_of_iterations_lt
> to analyze number of iterations for this kind of loop.
>
> Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
>
> gcc/ChangeLog:
>
> PR tree-optimization/101145
> * tree-ssa-loop-niter.c
> (number_of_iterations_until_wrap): New function.
> (number_of_iterations_lt): Invoke above function.
> (adjust_cond_for_loop_until_wrap):
> Merge to number_of_iterations_until_wrap.
> (number_of_iterations_cond): Update invokes for
> adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/101145
> * gcc.dg/vect/pr101145.c: New test.
> * gcc.dg/vect/pr101145.inc: New test.
> * gcc.dg/vect/pr101145_1.c: New test.
> * gcc.dg/vect/pr101145_2.c: New test.
> * gcc.dg/vect/pr101145_3.c: New test.
> ---
>  gcc/testsuite/gcc.dg/vect/pr101145.c   | 187 +
>  gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
>  gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
>  gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
>  gcc/tree-ssa-loop-niter.c  | 150 +++-
>  6 files changed, 380 insertions(+), 65 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
>

> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index b5add827018..06db6a36ef8 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -1473,6 +1473,86 @@ assert_loop_rolls_lt (tree type, affine_iv *iv0, 
> affine_iv *iv1,
>  }
>  }
>
> +/* Determines number of iterations of loop whose ending condition
> +   is IV0 < IV1 which likes:  {base, -C} < n,  or n < {base, C}.
> +   The number of iterations is stored to NITER.  */
> +
> +static bool
> +number_of_iterations_until_wrap (class loop *, tree type, affine_iv *iv0,
> +affine_iv *iv1, class tree_niter_desc *niter)
> +{
> +  tree niter_type = unsigned_type_for (type);
> +  tree max, min;
> +
> +  if (POINTER_TYPE_P (type))
> +{
> +  max = fold_convert (type, TYPE_MAX_VALUE (niter_type));
> +  min = fold_convert (type, TYPE_MIN_VALUE (niter_type));
> +}
> +  else
> +{
> +  max = TYPE_MAX_VALUE (type);
> +  min = TYPE_MIN_VALUE (type);
> +}
> +
> +  tree high = max, low = min, one = build_int_cst (niter_type, 1);
> +  tree step;
> +
> +  /* n < {base, C}. */
> +  if (integer_zerop (iv0->step) && TREE_CODE (iv1->step) == INTEGER_CST
> +  && !tree_int_cst_sign_bit (iv1->step))
> +{
> +  step = iv1->step;
> +  niter->niter = fold_build2 (MINUS_EXPR, niter_type, max, iv1->base);
max/iv1->base could be of pointer type, not sure if this is canonical though.

> +  if (TREE_CODE (iv1->base) == INTEGER_CST)
> +   low = fold_build2 (MINUS_EXPR, type, iv1->base, one);
> +  else if (TREE_CODE (iv0->base) == INTEGER_CST)
> +   low = iv0->base;
> +}
> +  /* {base, -C} < n. */
> +  else if (TREE_CODE (iv0->step) == INTEGER_CST
> +  && tree_int_cst_sign_bit (iv0->step) && integer_zerop (iv1->step))
> +{
> +  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (iv0->step), iv0->step);
> +  niter->niter = fold_build2 (MINUS_EXPR, niter_type, iv0->base, min);
> +  if (TREE_CODE (iv0->base) == INTEGER_CST)
> +   high = fold_build2 (PLUS_EXPR, type, iv0->base, one);
> +  else if (TREE_CODE (iv1->base) == INTEGER_CST)
> +   high = iv1->base;
> +}
> +  else
> +return false;
> +
> +  /* (delta + step - 1) / step */
> +  step = fold_convert (niter_type, step);
> +  niter->niter = fold_convert (niter_type, niter->niter);
> +  niter->niter = fold_build2 (PLUS_EXPR, niter_type, niter->niter, step);
> +  niter->niter = fold_build2 (FLOOR_DIV_EXPR, niter_type, niter->niter, 
> step);
> +
> +  tree m = fold_build2 (MINUS_EXPR, niter_type, high, low);
> +  m = fold_convert (niter_type, m);
> +  mpz_t mstep, tmp, mmax;
> +  mpz_init (mstep);
> +  mpz_init (tmp);
> +  mpz_init (mmax);
> +  wi::to_mpz (wi::to_wide (step), mstep, UNSIGNED);
> +  wi::to_mpz (wi::to_wide (m), mmax, UNSIGNED);
> +  mpz_add (tmp, mmax, mstep);
> +  mpz_sub_ui (tmp, 

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-06-07 Thread Bin.Cheng via Gcc-patches
On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:
>
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this didn't
> even required multiple loops, take for instance the following example
> using SVE intrinsics:
>
> #include 
> #include 
> extern void use (char *);
> void bar (char  * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
>  svbool_t all_true = svptrue_b8 ();
>unsigned i = 0;
>if (n < (UINT_MAX - svcntb() - 1))
>  {
>  for (; i < n; i += svcntb())
>  {
>  svuint8_t va = svld1 (all_true, (uint8_t*)a);
>  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
>  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
>  a += svcntb();
>  b += svcntb();
>  c += svcntb();
>  }
>  }
>use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not included
> 'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
> and c with a single one, (also used for the loop-control). See:
>
> [local count: 955630225]:
># ivtmp.7_8 = PHI 
>va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
>vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
>MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
>ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
>i_23 = (unsigned int) ivtmp.7_25;
>if (n_9(D) > i_23)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>   However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
>[local count: 955630225]:
># a_28 = PHI 
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_28];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>a_18 = a_28 + POLY_INT_CST [16, 16];
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_8 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_8)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
> to not cost uses outside of the loop. This makes IVOPTs generate a
> single IV, but unfortunately it decides to create the variable for the
> use inside the loop and it also seems to use the pre-increment value of
> the shared-IV and add the [16,16] to it. See:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_25;
>_6 = _7 + POLY_INT_CST [16, 16];
>a_18 = (char * restrict) _6;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_5 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_5)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var of
> the candidate IV. This means I have to insert it in a different place
> and make sure to delete the old use->stmt. I'm sure there is a better
> way to do this using IVOPTs current framework, but I didn't find one
> yet. See the result:
>
>[local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_24;
>a_18 = (char * restrict) _7;
>i_6 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_6)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running sink
> would solve it and it seems someone has added a second sink pass, so
> that saves me a third patch :) see after sink2:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, 

Re: [PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-06-06 Thread Bin.Cheng via Gcc-patches
On Wed, Jun 2, 2021 at 3:28 PM Richard Biener via Gcc-patches
 wrote:
>
> On Tue, Jun 1, 2021 at 4:00 PM bin.cheng via Gcc-patches
>  wrote:
> >
> > Hi,
> > As described in patch summary, this fixes the wrong code issue by adding 
> > overflow-ness
> > check for iv1.step - iv2.step.
> >
> > Bootstrap and test on x86_64.  Any comments?
>
> + bool wrap_p = TYPE_OVERFLOW_WRAPS (step_type);
> + if (wrap_p)
> +   {
> + tree t = fold_binary_to_constant (GE_EXPR, step_type,
> +   iv0->step, iv1->step);
> + wrap_p = integer_zerop (t);
> +   }
>
> I think we can't use TYPE_OVERFLOW_WRAPS/TYPE_OVERFLOW_UNDEFINED since
> that's only relevant for expressions written by the user - we're
> computing iv0.step - iv1.step
> which can even overflow when TYPE_OVERFLOW_UNDEFINED (in fact we may not
> even generate this expression then!).  So I think we have to do sth like
>
>/* If the iv0->step - iv1->step wraps, fail.  */
>if (!operand_equal_p (iv0->step, iv1->step)
>&& (TREE_CODE (iv0->step) != INTEGER_CST || TREE_CODE
> (iv1->step) != INTEGER_CST)
>&& !wi::gt (wi::to_widest (iv0->step), wi::to_widest (iv1->step))
>  return false;
>
> which only handles equality and all integer constant steps. You could
Thanks for the suggestion.  I realized that we have LE/LT/NE
conditions here, and for LE/LT what we need to check is iv0/iv1
converge to each other, rather than diverge.  Also steps here can only
be constants, so there is no need to use range information.

> also use ranges
> like
>
>  wide_int min0, max0, min1, max1;
>   if (!operand_equal_p (iv->step, iv1->step)
>   && (determine_value_range (iv0->step, , ) != VR_RANGE
>  || determine_value_range (iv1->step, , ) != VR_RANGE
>  || !wi::ge (min0, max1)))
>return false;
>
> Note I'm not sure why
>
>iv0->step = step;
>if (!POINTER_TYPE_P (type))
> iv0->no_overflow = false;
I don't exactly remember, this was added sometime when no_overflow was
introduced.  Note we only do various checks for non NE_EXPR so the
step isn't always less in absolute value?  I will check if we should
reset it in all cases.

Patch updated.  test ongoing.

Thanks,
bin
>
> here the no_overflow reset does not happen for pointer types?  Or
> rather why does
> it happen at all?  Don't we strictly make the step less in absolute value?
>
> > Thanks,
> > bin
From 395dd6595cabebb7fd3e71a5fbfe84544d598608 Mon Sep 17 00:00:00 2001
Author: Bin Cheng 
Date: Fri, 28 May 2021 16:49:54 +0800
Subject: [PATCH] Simplify exit cond comparing two IVs only if they converge.

We should also check that iv1.step >= iv2.step so that the two IVs
converge to each other under comparison condition LE_EXPR/LT_EXPR.

gcc:
PR tree-optimization/100740
* tree-ssa-loop-niter.c (number_of_iterations_cond): Check
  IVs converge or not.

gcc/testsuite:
* gcc.c-torture/execute/pr100740.c
---
 .../gcc.c-torture/execute/pr100740.c  | 11 ++
 gcc/tree-ssa-loop-niter.c | 20 +--
 2 files changed, 25 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr100740.c

diff --git a/gcc/testsuite/gcc.c-torture/execute/pr100740.c 
b/gcc/testsuite/gcc.c-torture/execute/pr100740.c
new file mode 100644
index 000..8fcdaffef3b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr100740.c
@@ -0,0 +1,11 @@
+/* PR tree-optimization/100740 */
+
+unsigned a, b;
+int main() {
+  unsigned c = 0;
+  for (a = 0; a < 2; a++)
+for (b = 0; b < 2; b++)
+  if (++c < a)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 325bd978609..6240084782a 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1782,7 +1782,9 @@ number_of_iterations_cond (class loop *loop,
  provided that either below condition is satisfied:
 
a) the test is NE_EXPR;
-   b) iv0.step - iv1.step is integer and iv0/iv1 don't overflow.
+   b) iv0.step and iv1.step are integers and:
+ - iv0 and iv1 don't overflow.
+ - iv0 and iv1 converge to each other, under cond LE_EXPR/LT_EXPR.
 
  This rarely occurs in practice, but it is simple enough to manage.  */
   if (!integer_zerop (iv0->step) && !integer_zerop (iv1->step))
@@ -1790,14 +1792,20 @@ number_of_iterations_cond (class loop *loop,
   tree step_type = POINTER_TYPE_P (type) ? sizetype : type;
   tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
   iv0-&g

[PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-06-01 Thread bin.cheng via Gcc-patches
Hi,
As described in patch summary, this fixes the wrong code issue by adding 
overflow-ness
check for iv1.step - iv2.step.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

pr100740-20210525.txt
Description: Binary data


[PATCH PR100499]Fix wrong niter info caused by overflow behavior

2021-05-26 Thread bin.cheng via Gcc-patches
Hi,
As described in PR100499, loop niters analysis for "!=" now relies on 
multiple_of_p which
so far is mostly implemented for no-overflow scenarios.  This patch fixes the 
issue by:
1. add new parameter to multiple_of_p indicating no-wrapping behavior in top 
expression.
2. pass new argument to help multiple_of_p, for example, the case in 
stor-layout.c.
3. adjust calls in number_of_iterations_ne by checking multiple_of_p without 
dependence
on wrapping/no-wrapping behavior.  It also corrects one issue in PR72817.

Noticed there are some improvements in ranger to help analysis of 
multiple_of_p.  For
TOP expression like "left op right", we would like to check if this specific op 
wraps or not.
For example:
1. left (unsigned, multiple of 3) + 0xfffd: the result is multiple of 3 if 
"+" wraps even if
right operand is not multiple of 3.
2. left (unsigned, multiple of 3) + 6: the result is not multiple of 3 if "+" 
wraps even if the
right operand is multiple of 3.
So in the end, we might need ranger to tell if such operator wraps or not, i.e, 
must_overflow,
must_not_overflow, may_overflow.  Of course, we need to be conservative in the 
last case.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

pr100499-20210520.txt
Description: Binary data


Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-05-07 Thread Bin.Cheng via Gcc-patches
On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches
 wrote:
>
> >> This patch implements a new loop optimization according to the proposal
> >> in RFC given at
> >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> >> So do not repeat the idea in this mail. Hope your comments on it.
> >
> > With the caveat that I'm not an optimization expert (but no one else
> > seems to have replied), here are some thoughts.
> >
> > [...snip...]
> >
> >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
> >>  [PR98598]
> >
> > BTW, did you mean to also post patches 2 and 3?
> >
>
> Not yet, but they are ready. Since this is kind of special optimization that 
> uses
> heap as temporary storage, not a common means in gcc, we do not know
> basic attitude of the community towards it. So only the first patch was sent
> out for initial comments, in that it implements a generic MGO framework, and
> is complete and self-contained. Other 2 patches just composed some
> enhancements for specific code pattern and dynamic alias check. If possible,
> this proposal would be accepted principally, we will submit other 2 for 
> review.
>
> >
> >> In nested loops, if scattered memory accesses inside inner loop remain
> >> unchanged in outer loop, we can sequentialize these loads by caching
> >> their values into a temporary memory region at the first time, and
> >> reuse the caching data in following iterations. This way can improve
> >> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> >
> > I don't think you've cited any performance numbers so far.  Does the
> > optimization show a measurable gain on some benchmark(s)?  e.g. is this
> > ready to run SPEC yet, and how does it do?
>
> Yes, we have done that. Minor improvement about several point percentage
> could gain for some real applications. And to be specific, we also get major
> improvement as more than 30% for certain benchmark in SPEC2017.
Hi Feng Xue,
Could you help point out which bench it is?  I failed to observe
improvement in spec2017 on local x86 machine.  I am running with O3
level.

Thanks,
bin
>
> >
> >> To illustrate what the optimization will do, two pieces of pseudo code,
> >> before and after transformation, are given. Suppose all loads and
> >> "iter_count" are invariant in outer loop.
> >>
> >> From:
> >>
> >>   outer-loop ()
> >> {
> >>   inner-loop (iter, iter_count)
> >> {
> >>   Type1 v1 = LOAD (iter);
> >>   Type2 v2 = LOAD (v1);
> >>   Type3 v3 = LOAD (v2);
> >>   ...
> >>   iter = NEXT (iter);
> >> }
> >> }
> >>
> >> To:
> >>
> >>   typedef struct cache_elem
> >> {
> >>   bool   init;
> >>   Type1  c_v1;
> >>   Type2  c_v2;
> >>   Type3  c_v3;
> >
> > Putting the "bool init;" at the front made me think "what about
> > packing?" but presumably the idea is that every element is accessed in
> > order, so it presumably benefits speed to have "init" at the top of the
> > element, right?
>
> Yes, layout of the struct layout could be optimized in terms of size by
> some means, such as:
>   o. packing "init" into a padding hole after certain field
>   o. if certain field is a pointer type, the field can take the role of "init"
>   (Non-NULL implies "initialized")
> Now this simple scheme is straightforward, and would be enhanced
> in various aspects later.
>
> >> } cache_elem;
> >>
> >>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
>
> > What if the allocation fails at runtime?  Do we keep an unoptimized
> > copy of the nested loops around as a fallback and have an unlikely
> > branch to that copy?
>
> Yes, we should. But in a different way, a flag is added into original
> nested loop to control runtime switch between optimized and
> unoptimized execution. This definitely incurs runtime cost, but
> avoid possible code size bloating. A better handling, as a TODO is
> to apply dynamic-switch for large loop, and loop-clone for small one.
>
> > I notice that you're using calloc, presumably to clear all of the
> > "init" flags (and the whole buffer).
> >
> > FWIW, this feels like a case where it would be nice to have a thread-
> > local heap allocation, perhaps something like an obstack implemented in
> > the standard library - but that's obviously scope creep for this.
>
> Yes, that's good, specially for many-thread application.
>
> > Could it make sense to use alloca for small allocations?  (or is that
> > scope creep?)
>
> We did consider using alloca as you said.  But if we could not determine
> up limit for a non-constant size, we have to place alloca inside a loop that
> encloses the nested loop. Without a corresponding free operation, this
> kind of alloca-in-loop might cause stack overflow. So it becomes another
> TODO.
>
> >>   outer-loop ()
> >> {
> >>   size_t cache_idx = 0;
> >>
> >>   inner-loop (iter, iter_count)
> >> {
> >>   if 

Re: [PATCH RFC][PR98736]Compute and use programing order preserved RPO in loop distribution

2021-03-23 Thread bin.cheng via Gcc-patches
> > In the patch, I just duplicated and created new function 
> > loop_first_rev_post_order_compute_fn.
> > I am not sure if I should change the original function 
> > pre_and_rev_post_order_compute_fn
> > (maybe not at this stage)? I am neither sure about the name, though haven't 
> > got a better one.
> > Any comment is appreciated?
> 
> So your new function seems to do what rev_post_order_and_mark_dfs_back_seme 
> does
> when you specify for_iteration = true.  Specifically that function then does
> 
>If FOR_ITERATION is true then compute an RPO where SCCs form a
>contiguous region in the RPO array.

Right, this is exactly what I need.  Attachment is the updated patch.  
Bootstrap and test on x86_64.
> 
> it in particular should handle the situation well where there are multiple 
> exits
> from a loop to different outer loops (consider a switch stmt exiting to
> the immediate outer or to the next outer loop) - something your patch
> likely still handles "randomly" (though we of course generally do not handle
> such exits well).
> 
> The idea of the rev_post_order_and_mark_dfs_back_seme algorithm is to
> treat SCCs as single nodes for the "outermost" RPO walk and then
> continuously expand SCCs from outer to inner.
> 
> So for the testcase I'm quite sure using the this function for computing
> the RPO would work but extra thoughts are appreciated.
Maybe another more straightforward or easier to understand function name?

Thanks,
bin
> 
> Thanks,
> Richard.
>

pr98736-20210323.txt
Description: Binary data


[PATCH RFC][PR98736]Compute and use programing order preserved RPO in loop distribution

2021-03-22 Thread bin.cheng via Gcc-patches
Hi,

This is the patch for PR98736.  The root cause is like:

Use programing order preserved RPO in loop distribution.

Tree loop distribution uses RPO to build reduced dependence graph,
it's important that RPO preserves the original programing order and
usually it does.  However, when distributing loop nest, the exit bb
could be placed before some loop basic blocks while after loop header.

This patch fixes the issue by preferring loop exit edge in DFS when
computing RPO.

In the patch, I just duplicated and created new function 
loop_first_rev_post_order_compute_fn.
I am not sure if I should change the original function 
pre_and_rev_post_order_compute_fn 
(maybe not at this stage)? I am neither sure about the name, though haven't got 
a better one.
Any comment is appreciated?

Bootstrap and test on x86_64.

Thanks,
bin

gcc/ChangeLog:

PR tree-optimization/98736
* cfganal.c (loop_first_rev_post_order_compute_fn): New function.
* cfganal.h (loop_first_rev_post_order_compute_fn): New decl.
* tree-loop-distribution.c (loop_distribution::bb_top_order_init):
Compute RPO with programing order preserved by calling above.

gcc/testsuite/ChangeLog:

PR tree-optimization/98736
* gcc.c-torture/execute/pr98736.c: New test.

pr98736-20210322.txt
Description: Binary data


Re: [PATCH PR97627]Avoid computing niters info for fake edges

2021-01-29 Thread Bin.Cheng via Gcc-patches
On Fri, Jan 29, 2021 at 3:55 PM Richard Biener
 wrote:
>
> On Thu, Jan 28, 2021 at 11:31 AM Bin.Cheng  wrote:
> >
> > On Thu, Jan 28, 2021 at 5:08 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Jan 28, 2021 at 3:49 AM bin.cheng via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi,
> > > > As described in commit message, we need to avoid computing niters info 
> > > > for fake
> > > > edges.  This simple patch does this by two changes.
> > > >
> > > > Bootstrap and test on X86_64, is it ok?
> > >
> > > Hmm, so I think the patch is a bit complicated and avoiding niter compute
> > > for fake edges would be easier when just returning false for
> > > fake edges in number_of_iterations_exit_assumptions?
> > I just grepped calls to get_loop_exit_edges, and thought there might
> > be cases other than niters analysis that would also like to skip fake
> > edges.  But I didn't check the calls one by one.
>
> My hunch is that the usual APIs always want to ignore them, but let's
> do a minimal fix that we can backport easily.
Yeah, please apply the trivial patch.

Thanks,
bin
>
> > >
> > > Which pass was the problematical that had infinite loops connected to 
> > > exit?
> > >
> > > I guess the cfgloop code should simply ignore fake exits - they mostly
> > > exist to make reverse CFG walks easy.  Specifically single_exit
> > > and single_likely_exit but also exit edge recording should ignore them.
> > >
> > > That said, the testcase seems to be fixed with just
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index 7d61ef080eb..7775bc7275c 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -2407,6 +2407,11 @@ number_of_iterations_exit_assumptions (class
> > > loop *loop, edge exit,
> > >affine_iv iv0, iv1;
> > >bool safe;
> > >
> > > +  /* The condition at a fake exit (if it exists) does not control its
> > > + execution.  */
> > > +  if (exit->flags & EDGE_FAKE)
> > > +return false;
> > > +
> > >/* Nothing to analyze if the loop is known to be infinite.  */
> > >if (loop_constraint_set_p (loop, LOOP_C_INFINITE))
> > >  return false;
> > >
> > > Your dfs_find_deadend change likely "breaks" post-dominance DFS order
> > > (this is a very fragile area).
> > >
> > > So any objection to just simplify the patch to the above hunk?
> > Considering we are in late stage3? No objection to this change.  But I
> > do think dfs_find_deadend needs to be improved, if not as this patch
> > does.  For a loop nest with the outermost loop as the infinite one,
> > the function adds fake (exit) edges for inner loops, which is
> > counter-intuitive.
>
> Sure, but then this is independent of the PR.  As said, the fake exits
> only exist to make reverse CFG walkers easier - yes, for natural
> infinite loops we'd like to have "intuitive" post-dom behavior but for
> example for irreducible regions there's not much to do.
>
> Richard.
>
> > Thanks,
> > bin
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > bin


Re: [PATCH PR97627]Avoid computing niters info for fake edges

2021-01-28 Thread Bin.Cheng via Gcc-patches
On Thu, Jan 28, 2021 at 5:08 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Jan 28, 2021 at 3:49 AM bin.cheng via Gcc-patches
>  wrote:
> >
> > Hi,
> > As described in commit message, we need to avoid computing niters info for 
> > fake
> > edges.  This simple patch does this by two changes.
> >
> > Bootstrap and test on X86_64, is it ok?
>
> Hmm, so I think the patch is a bit complicated and avoiding niter compute
> for fake edges would be easier when just returning false for
> fake edges in number_of_iterations_exit_assumptions?
I just grepped calls to get_loop_exit_edges, and thought there might
be cases other than niters analysis that would also like to skip fake
edges.  But I didn't check the calls one by one.
>
> Which pass was the problematical that had infinite loops connected to exit?
>
> I guess the cfgloop code should simply ignore fake exits - they mostly
> exist to make reverse CFG walks easy.  Specifically single_exit
> and single_likely_exit but also exit edge recording should ignore them.
>
> That said, the testcase seems to be fixed with just
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index 7d61ef080eb..7775bc7275c 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -2407,6 +2407,11 @@ number_of_iterations_exit_assumptions (class
> loop *loop, edge exit,
>affine_iv iv0, iv1;
>bool safe;
>
> +  /* The condition at a fake exit (if it exists) does not control its
> + execution.  */
> +  if (exit->flags & EDGE_FAKE)
> +return false;
> +
>/* Nothing to analyze if the loop is known to be infinite.  */
>if (loop_constraint_set_p (loop, LOOP_C_INFINITE))
>  return false;
>
> Your dfs_find_deadend change likely "breaks" post-dominance DFS order
> (this is a very fragile area).
>
> So any objection to just simplify the patch to the above hunk?
Considering we are in late stage3? No objection to this change.  But I
do think dfs_find_deadend needs to be improved, if not as this patch
does.  For a loop nest with the outermost loop as the infinite one,
the function adds fake (exit) edges for inner loops, which is
counter-intuitive.

Thanks,
bin
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin


[PATCH PR97627]Avoid computing niters info for fake edges

2021-01-27 Thread bin.cheng via Gcc-patches
Hi,
As described in commit message, we need to avoid computing niters info for fake
edges.  This simple patch does this by two changes.  

Bootstrap and test on X86_64, is it ok?

Thanks,
bin

pr97627-20210128.patch
Description: Binary data


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-22 Thread Bin.Cheng via Gcc-patches
On Tue, Sep 22, 2020 at 10:30 AM HAO CHEN GUI  wrote:
>
> Bin,
>
> I just tested your patch on current trunk.  Here is my summary.
>
> 1. About some iv aren't moved out of inner loop (Lijia mentioned in his
> last email)
>
>[local count: 955630226]:
># l_32 = PHI <1(12), l_54(21)>
># ivtmp_165 = PHI <_446(12), ivtmp_155(21)>
>_26 = (integer(kind=8)) l_32;
>_27 = _25 + _26;
>y__I_lsm.119_136 = (*y_135(D))[_27];
>y__I_lsm.119_90 = m_55 != 1 ? y__I_lsm.119_136 : 0.0;
>_37 = _36 * stride.88_111;
>_38 = _35 + _37;
>_39 = _26 + _38;
>_40 = (*a_137(D))[_39];
>
> The offset _39 is not loop independent as it relies on _26. But _38 and
> _37 should be loop independent. So Lijia thought they should be moved
> out of loop.
>
> I checked the following pass and found that these  statements are
> eliminated after vertorizing and dce.
>
> In vect dump,
>
> simple.F:27:23: note:  -->vectorizing statement: _37 = _36 *
> stride.88_111;
> simple.F:27:23: note:  -->vectorizing statement: _38 = _35 + _37;
> simple.F:27:23: note:  -->vectorizing statement: _39 = _26 + _38;
> simple.F:27:23: note:  -->vectorizing statement: _40 = (*a_137(D))[_39];
> simple.F:27:23: note:  transform statement.
> simple.F:27:23: note:  transform load. ncopies = 1
> simple.F:27:23: note:  create vector_type-pointer variable to type:
> vector(2) real(kind=8)  vectorizing an array ref: (*a_137(D))
> simple.F:27:23: note:  created vectp_a.131_383
> simple.F:27:23: note:  add new stmt: vect__40.132_374 = MEM  real(kind=8)> [(real(kind=8) *)vectp_a.130_376];
>
> In dce dump,
>
> Deleting : _39 = _26 + _38;
>
> Deleting : _38 = _35 + _37;
>
> Deleting : _37 = _36 * stride.88_111;
>
> So it's reasonable to only consider data reference after loop
> interchange. Other statements may be eliminated or be moved out of loop
> in last lim pass if they're real expensive.
>
> 2. I tested the SPEC on powerpc64le-linux-gnu. 503.bwaves_r got 6.77%
> performance improvement with this patch. It has no impact on other
> benchmarks.
>
> 3. The patch passed bootstrapped and regression test on
> powerpc64le-linux-gnu.
>
> I think the patch works fine. Could you please add it into trunk? Thanks
> a lot.
Hmm, IIRC the patch was intended to show what the missing transform
is, and I think it has latent bugs which I haven't got time to refine.
As Richard mentioned, could you please explore this with the existing
LIM facility, rather than introducing new code implementing existing
transforms?

Thanks,
bin
>
>
> On 8/9/2020 下午 6:18, Bin.Cheng wrote:
> > On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
> >> Hi,
> >>
> >> I want to follow Lijia's work as I gained the performance benefit on
> >> some SPEC workloads by adding a im pass after loop interchange.  Could
> >> you send me the latest patches? I could do further testing. Thanks a lot.
> > Hi,
> > Hmm, not sure if this refers to me?  I only provided an example patch
> > (which isn't complete) before Lijia's.  Unfortunately I don't have any
> > latest patch about this either.
> > As Richard suggested, maybe you (if you work on this) can simplify the
> > implementation.  Anyway, we only need to hoist memory references here.
> >
> > Thanks,
> > bin
> >> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
> >>


[PATCH PR93334][RFC]Skip output dep if values stored are byte wise the same

2020-09-15 Thread bin.cheng via Gcc-patches
Hi,
As suggested by PR93334 comments, this patch adds an interface identifying
output dependence which can be skipped in terms of reordering and skip it in
loop distribution.  It also adds a new test case.  Any comment?

Thanks,
bin

0001-Skip-output-dependence-if-values-stored-are-bytewise.patch
Description: Binary data


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-08 Thread Bin.Cheng via Gcc-patches
On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
>
> Hi,
>
> I want to follow Lijia's work as I gained the performance benefit on
> some SPEC workloads by adding a im pass after loop interchange.  Could
> you send me the latest patches? I could do further testing. Thanks a lot.
Hi,
Hmm, not sure if this refers to me?  I only provided an example patch
(which isn't complete) before Lijia's.  Unfortunately I don't have any
latest patch about this either.
As Richard suggested, maybe you (if you work on this) can simplify the
implementation.  Anyway, we only need to hoist memory references here.

Thanks,
bin
>
> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
>


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Bin.Cheng via Gcc-patches
On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
 wrote:
>
> On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> > on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >  1) Currently address_cost hook on rs6000 always return zero, but at 
> >  least
> >  from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >  have to take the address update into account (scalar normal operation).
> > >>>
> > >>> From Power4 on already (not sure about Power6, but does anyone care?)
> > >>
> > >> Thanks for the information, it looks this issue exists for a long time.
> > >
> > > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > > For example, an
> > >   ldu 3,16(4)
> > > is cracked to an
> > >   ld 3,16(4)
> > > and an
> > >   addi 4,4,16
> > > (the addi is not on the critical path of the load).  So it seems to me
> > > this shouldn't increase the addressing cost at all?  (The instruction of
> > > course is really two insns in one.)
> >
> > Good question!  I agree that they can execute in parallel, but it depends
> > on how we interprete the addressing cost, if it's for required execution
> > resource, I think it's off, since comparing with ld, the ldu has two iops
> > and extra ALU requirement.
>
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).
>
> > I'm not sure its usage elsewhere, but in the
> > context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> > the cost for group (1) is zero, total cost is 4 for this combination.
> > for the scenario like:
> > ldx rx, iv // (1)
> > ...
> > iv = iv + step // (2)
> >
> > While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> > for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> > say the step update is free.
>
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?
>
> > We can also see (1) and (2) can also execute in parallel (same iteration).
> > If we consider the next iteration, it will have the dependency, but it's
> > the same for ldu.  So basically they are similar, I think it's unfair to
> > have this difference in the cost modeling.  The cracked addi should have
> > its cost here.  Does it make sense?
>
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.
Hi Segher,
In simple cases, yes, and it is also the (rough) idea of modeling
auto-inc addressing mode in ivopts, however, things are different if
loop gets complicated.  Considering the case choosing 10 auto-inc
addressing_mode/candidate vs. [base_x + iv_index].  The latter only
needs one add instruction, while the former needs 10 embedded auto-inc
operations.
Another issue is register pressure, choosing auto-inc candidates could
result in more IV, while choosing IV_index results in one IV (and more
Base pointers), however, spilling base pointer (which is loop
invariant) is usually cheaper than IV.
Another issue is auto-inc candidates probably lead to more bloated
setup code in the preheader BB, due to problems in expression
canonicalization, CSE, etc..

So it's not that easy to answer the question for complicated cases.
As for simple cases, the current model works fine with auto-inc
(somehow) preferred.

Thanks,
bin
>
> > Apart from that, one P9 specific point is that the update form load isn't
> > preferred,  the reason is that the instruction can not retire until both
> > parts complete, it can hold up subsequent instructions from retiring.
> > If the addi stalls (starvation), the instruction can not retire and can
> > cause things stuck.  It seems also something we can model here?
>
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
>
>
> Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Wed, Sep 2, 2020 at 11:50 AM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >> 2) This case makes me think we should exclude ainc candidates in function
> >> mark_reg_offset_candidates.  The justification is that: ainc candidate
> >> handles step update itself and when we calculate the cost for it against
> >> its ainc_use, the cost_step has been reduced. When unrolling happens,
> >> the ainc computations are replicated and it doesn't save step updates
> >> like normal reg_offset_p candidates.
> > Though auto-inc candidate embeds stepping operation into memory
> > access, we might want to avoid it in case of unroll if there are many
> > sequences of memory accesses, and if the unroll factor is big.  The
> > rationale is embedded stepping is a u-arch operation and does have its
> > cost.
> >
>
> Thanks for the comments!  Agree!  Excluding them from reg_offset_p
> candidates here is consistent with this expectation, it makes us
> consider the unroll factor effect when checking the corresponding
> step cost and the embedded stepping cost (in group/candidate cost,
> minus step cost and use the cost from the address_cost hook).
>
> >>
> >> I've updated the patch to punt ainc_use candidates as below:
> >>
> >>> + /* Skip AINC candidate since it contains address update itself,
> >>> +the replicated AINC computations when unrolling still have
> >>> +updates, unlike reg_offset_p candidates can save step updates
> >>> +when unrolling.  */
> >>> + if (cand->ainc_use)
> >>> +   continue;
> >>> +
> >>
> >> For this new attached patch, it's bootstrapped and regress-tested without
> >> explicit unrolling, while the only one failure has been identified as
> >> rs6000 specific address_cost issue in bootstrapping and regression testing
> >> with explicit unrolling.
> >>
> >> By the way, with above simple hack of address_cost, I also did one
> >> bootstrapping and regression testing with explicit unrolling, the above
> >> sms-4.c did pass as I expected but had two failures instead:
> >>
> >>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
> >>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts 
> >> "PHI" 2
> >>
> >> By further investigation, the 2nd one is expected due to the adddress_cost 
> >> hook
> >> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions 
> >> change.
> >> I believe it's just exposed by this combination unluckily/luckily ;-) I 
> >> will
> >> send a patch separately for it once I got time to look into it, but it 
> >> should
> >> be unrelated to this patch series for sure.
> > This is the kind of situation I intended to avoid before.  IMHO, this
> > isn't a neat change (it can't be given we are predicting the future
> > transformation in compilation pipeline), accumulation of such changes
> > could make IVOPTs break in one way or another.  So as long as you make
> > sure it doesn't have functional impact in case of no-rtl_unroll, I am
> > fine.
>
> Yeah, I admit it's not neat, but the proposals in the previous discussions
> without predicting unroll factor can not work well for all scenarios with
> different unroll factors, they could over-blame some kind of candidates.
> For the case of no-rtl_unroll, unroll factor estimation should set
> loop->estimated_unroll to zero, all these changes won't take effect. The
> estimation function follows the same logics as that of RTL unroll factor
> calculation, I did test with explicit unrolling disablement before, it
> worked expectedly.
Thanks for working on this, also sorry for being nitpicking.

Thanks,
bin


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Tue, Aug 25, 2020 at 8:47 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >>
> >> For one particular case like:
> >>
> >> for (i = 0; i < SIZE; i++)
> >>   y[i] = a * x[i] + z[i];
> >>
> >> we will mark reg_offset_p for IV candidates on x as below:
> >>- (unsigned long) (x_18(D) + 8)// only mark this before.
> >>- x_18(D) + 8
> >>- (unsigned long) (x_18(D) + 24)
> >>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
> >> 18446744073709551600)
> >>...
> >>
> >> Do you mind to have a review again?  Thanks in advance!
> > I trust you with the change.
>
> Thanks again!  Sorry for the late since it took some time to investigate
> the exposed issues.
>
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
> >>
> >> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> > Is this run with unroll-loops?
>
> Yes, the options what I used were:
>
>-Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops
>
> I also re-tested the newly attached patch, nothing changes for SPEC2017 data.
>
> > Could you exercise the code with unroll-loops enabled when
> > bootstrap/regtest please?  It doesn't matter if cases fail with
> > unroll-loops, just making sure there is no fallout.  Otherwise it's
> > fine with me.
>
> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> but the regression testing did show one failure (the only one):
>
>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>
> It exposes two issues:
>
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).
> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> one separated issue.
>
> 2) This case makes me think we should exclude ainc candidates in function
> mark_reg_offset_candidates.  The justification is that: ainc candidate
> handles step update itself and when we calculate the cost for it against
> its ainc_use, the cost_step has been reduced. When unrolling happens,
> the ainc computations are replicated and it doesn't save step updates
> like normal reg_offset_p candidates.
Though auto-inc candidate embeds stepping operation into memory
access, we might want to avoid it in case of unroll if there are many
sequences of memory accesses, and if the unroll factor is big.  The
rationale is embedded stepping is a u-arch operation and does have its
cost.

>
> I've updated the patch to punt ainc_use candidates as below:
>
> > + /* Skip AINC candidate since it contains address update itself,
> > +the replicated AINC computations when unrolling still have
> > +updates, unlike reg_offset_p candidates can save step updates
> > +when unrolling.  */
> > + if (cand->ainc_use)
> > +   continue;
> > +
>
> For this new attached patch, it's bootstrapped and regress-tested without
> explicit unrolling, while the only one failure has been identified as
> rs6000 specific address_cost issue in bootstrapping and regression testing
> with explicit unrolling.
>
> By the way, with above simple hack of address_cost, I also did one
> bootstrapping and regression testing with explicit unrolling, the above
> sms-4.c did pass as I expected but had two failures instead:
>
>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>
> By further investigation, the 2nd one is expected due to the adddress_cost 
> hook
> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
> I believe it's just exposed by this combination unluckily/luckily ;-) I will
> send a patch separately for it once I got time to look into it, but it should
> be unrelated to this patch series for sure.
This is the kind of situation I intended to avoid before.  IMHO, this
isn't a neat change (it can't be given we are predicting the future
transformation in compilation pipeline), accumulation of such changes
could make IVOPTs break in one way or another.  So as long as you make
sure it doesn't have functional impact in case of no-rtl_unroll, I am
fine.
>
> In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
> patch looks fine.
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (mark_reg_offset_candidates): New function.
> (add_candidate_1): Set 

Re: [PATCH 3/4 v2] ivopts: Consider cost_step on different forms during unrolling

2020-08-21 Thread Bin.Cheng via Gcc-patches
On Tue, Aug 18, 2020 at 5:03 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> > I see, it's similar to the auto-increment case where cost should be
> > recorded only once.  So this is okay given 1) fine predicting
> > rtl-unroll is likely impossible here; 2) the patch has very limited
> > impact.
> >
> Really appreciate your time and patience!
>
> I extended the previous version to address Richard S.'s comments on
> candidates with the same base/step but different offsets here:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547014.html.
>
> The previous version only allows the candidate derived from the group
> of interest, this updated patch extends it to those ones which have the
> same bases/steps and same/different offsets but in the acceptable range
> by considering unrolling.
>
> For one particular case like:
>
> for (i = 0; i < SIZE; i++)
>   y[i] = a * x[i] + z[i];
>
> we will mark reg_offset_p for IV candidates on x as below:
>- (unsigned long) (x_18(D) + 8)// only mark this before.
>- x_18(D) + 8
>- (unsigned long) (x_18(D) + 24)
>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
> 18446744073709551600)
>...
>
> Do you mind to have a review again?  Thanks in advance!
I trust you with the change.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
>
> SPEC2017 P9 performance run has no remarkable degradations/improvements.
Is this run with unroll-loops?
Could you exercise the code with unroll-loops enabled when
bootstrap/regtest please?  It doesn't matter if cases fail with
unroll-loops, just making sure there is no fallout.  Otherwise it's
fine with me.

Thanks,
bin
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (mark_reg_offset_candidates): New function.
> (add_candidate_1): Set reg_offset_p to false for new candidate.
> (set_group_iv_cost): Scale up group cost with estimate_unroll_factor 
> if
> consider_reg_offset_for_unroll_p.
> (determine_iv_cost): Increase step cost with estimate_unroll_factor if
> consider_reg_offset_for_unroll_p.
> (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
> consider_reg_offset_for_unroll_p.
>


Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-15 Thread Bin.Cheng via Gcc-patches
On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> on 2020/8/10 下午8:38, Bin.Cheng wrote:
> > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin  wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for the review!!
> >>
> >> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> >>> Hi Kewen,
> >>> Sorry for the late reply.
> >>> The patch's most important change is below cost computation:
> >>>
>  @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, 
>  struct iv_cand *cand)
>  cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>    cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> 
>  +  /* Consider additional step updates during unrolling.  */
>  +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>  +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> >>> This is a bit strange, to me the add instructions are additional
> >>> computation caused by unrolling+addressing_mode, rather than a native
> >>> part in candidate itself.  Specifically, an additional cost is needed
> >>> if candidates (without reg_offset_p) are chosen for the address type
> >>> group/uses.
> >>
> >> Good point, ideally it should be one additional cost for each cand set,
> >> when we select one cand for one group, we need to check this pair need
> >> more (estimated_unroll - 1) step costs, we probably need to care about
> >> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> >> doesn't support this and it could increase the selection complexity and
> >> time.  I hesitated to do it and put it to cand step cost initially instead.
> >>
> >> I was thinking those candidates with reg_offset_p should be only used for
> >> those reg_offset_p groups in most cases (very limited) meanwhile the others
> >> are simply scaled up like before.  But indeed this can cover some similar
> >> cases like one cand is only used for the compare type group which is for
> >> loop closing, then it doesn't need more step costs for unrolling.
> >>
> >> Do you prefer me to improve the current cost framework?
> > No, I don't think it's relevant to the candidate selecting algorithm.
> > I was thinking about adjusting cost somehow in
> > determine_group_iv_cost_address. Given we don't expose selected
> > addressing mode in this function, you may need to do it in
> > get_address_cost, either way.
> >
>
> Thanks for your suggestion!
>
> Sorry, I may miss something, but I still think the additional cost is
> per candidate.  The justification is that we miss to model the iv
> candidate step well in the context of unrolling, the step cost is part
> of candidate cost, which is per candidate.
>
> To initialize it in determine_iv_cost isn't perfect as you pointed out,
> ideally we should check any uses of the candidate requires iv update
> after each replicated iteration, and take extra step costs into account
> if at least one needs, meanwhile scaling up all the computation cost to
> reflect unrolling cost nature.
I see, it's similar to the auto-increment case where cost should be
recorded only once.  So this is okay given 1) fine predicting
rtl-unroll is likely impossible here; 2) the patch has very limited
impact.

Thanks,
bin
>
> Besides, the reg_offset desirable pair already takes zero cost for
> cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
> suggesting increasing the cost for non reg_offset pairs?  If so and per
> pair, the extra cost looks possible to be computed several times
> unexpectedly.
>
> >>
>  +
>    /* Prefer the original ivs unless we may gain something by replacing 
>  it.
>   The reason is to make debugging simpler; so this is not relevant for
>   artificial ivs created by other optimization passes.  */
> 
> >>>
>  @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>    return;
>  }
> 
>  +  /* Since we priced more on non reg_offset IV cand step cost, we 
>  should scale
>  + up the appropriate IV group costs.  Simply consider USE_COMPARE at 
>  the
>  + loop exit, FIXME if multiple exits supported or no loop exit 
>  comparisons
>  + matter.  */
>  +  if (data->consider_reg_offset_for_unroll_p
>  +  && group->vuses[0]->type != USE_COMPARE)
>  +cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> >>> doesn't make much sense to me.  Also why generic type uses are not
> >>> skipped?  We want to model the cost required for address computation,
> >>> however, for generic type uses there is no way to save the computation
> >>> in "address expression".  Once unrolled, the computation is always
> >>> there?
> >>>
> >>
> >> The main intention is to scale up the group/cand cost for unrolling since
> >> we have scaled up the step costs.  The assumption is that the original
> > If we adjust cost 

Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-10 Thread Bin.Cheng via Gcc-patches
On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> Thanks for the review!!
>
> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> > Hi Kewen,
> > Sorry for the late reply.
> > The patch's most important change is below cost computation:
> >
> >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct 
> >> iv_cand *cand)
> >> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>
> >> +  /* Consider additional step updates during unrolling.  */
> >> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >> +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> > This is a bit strange, to me the add instructions are additional
> > computation caused by unrolling+addressing_mode, rather than a native
> > part in candidate itself.  Specifically, an additional cost is needed
> > if candidates (without reg_offset_p) are chosen for the address type
> > group/uses.
>
> Good point, ideally it should be one additional cost for each cand set,
> when we select one cand for one group, we need to check this pair need
> more (estimated_unroll - 1) step costs, we probably need to care about
> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> doesn't support this and it could increase the selection complexity and
> time.  I hesitated to do it and put it to cand step cost initially instead.
>
> I was thinking those candidates with reg_offset_p should be only used for
> those reg_offset_p groups in most cases (very limited) meanwhile the others
> are simply scaled up like before.  But indeed this can cover some similar
> cases like one cand is only used for the compare type group which is for
> loop closing, then it doesn't need more step costs for unrolling.
>
> Do you prefer me to improve the current cost framework?
No, I don't think it's relevant to the candidate selecting algorithm.
I was thinking about adjusting cost somehow in
determine_group_iv_cost_address. Given we don't expose selected
addressing mode in this function, you may need to do it in
get_address_cost, either way.

>
> >> +
> >>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>  The reason is to make debugging simpler; so this is not relevant for
> >>  artificial ivs created by other optimization passes.  */
> >>
> >
> >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>   return;
> >> }
> >>
> >> +  /* Since we priced more on non reg_offset IV cand step cost, we should 
> >> scale
> >> + up the appropriate IV group costs.  Simply consider USE_COMPARE at 
> >> the
> >> + loop exit, FIXME if multiple exits supported or no loop exit 
> >> comparisons
> >> + matter.  */
> >> +  if (data->consider_reg_offset_for_unroll_p
> >> +  && group->vuses[0]->type != USE_COMPARE)
> >> +cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> > Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> > doesn't make much sense to me.  Also why generic type uses are not
> > skipped?  We want to model the cost required for address computation,
> > however, for generic type uses there is no way to save the computation
> > in "address expression".  Once unrolled, the computation is always
> > there?
> >
>
> The main intention is to scale up the group/cand cost for unrolling since
> we have scaled up the step costs.  The assumption is that the original
If we adjust cost appropriately in function *group_iv_cost_address,
this would become unnecessary, right?  And naturally.
> costing (without this patch) can be viewed as either for all unrolled
> iterations or just one single iteration.  Since IVOPTs doesn't support
> fractional costing, I interpreted it as single iterations, to emulate
> unrolling scenario based on the previous step cost scaling, we need to
> scale up the cost for all computation.
>
> In most cases, the compare type use is for loop closing, there is still
> only one computation even unrolling happens, so I excluded it here.
> As "FIXME", if we find some cases are off, we can further restrict it to
> those USE_COMPARE uses which is exactly for loop closing.
>
> > And what's the impact on targets supporting [base + index + offset]
> > addressing mode?
>
> Good question, I didn't notice it since power doesn't support it.
> I noticed the comments of function addr_offset_valid_p only mentioning
> [base + offset], I guess it excludes [base + index + offset]?
> But I guess the address-based IV can work for this mode?
No, addr_offset_valid_p is only used to split address use groups.  See
get_address_cost and struct mem_address.
I forgot to ask, what about target only supports [base + offset]
addressing mode like RISC-V?  I would expect it's not affected at all.

>
> >
> > Given the patch is not likely to harm because rtl loop unrolling is
> > (or was?) by default disabled, so I am OK once the 

Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-08 Thread Bin.Cheng via Gcc-patches
Hi Kewen,
Sorry for the late reply.
The patch's most important change is below cost computation:

> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct 
> iv_cand *cand)
> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>
> +  /* Consider additional step updates during unrolling.  */
> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
This is a bit strange, to me the add instructions are additional
computation caused by unrolling+addressing_mode, rather than a native
part in candidate itself.  Specifically, an additional cost is needed
if candidates (without reg_offset_p) are chosen for the address type
group/uses.
> +
>   /* Prefer the original ivs unless we may gain something by replacing it.
>  The reason is to make debugging simpler; so this is not relevant for
>  artificial ivs created by other optimization passes.  */
>

> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>   return;
> }
>
> +  /* Since we priced more on non reg_offset IV cand step cost, we should 
> scale
> + up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> + loop exit, FIXME if multiple exits supported or no loop exit comparisons
> + matter.  */
> +  if (data->consider_reg_offset_for_unroll_p
> +  && group->vuses[0]->type != USE_COMPARE)
> +cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
Not quite follow here, giving "pricing more on on-reg_offset IV cand"
doesn't make much sense to me.  Also why generic type uses are not
skipped?  We want to model the cost required for address computation,
however, for generic type uses there is no way to save the computation
in "address expression".  Once unrolled, the computation is always
there?

And what's the impact on targets supporting [base + index + offset]
addressing mode?

Given the patch is not likely to harm because rtl loop unrolling is
(or was?) by default disabled, so I am OK once the above comments are
addressed.

I wonder if it's possible to get 10% of (all which should be unrolled)
loops unrolled (conservatively) on gimple and enable it by default at
O3, rather than teaching ivopts to model a future pass which not
likely be used outside of benchmarks?

Thanks,
bin

> +
>   if (data->consider_all_candidates)
> {
>   group->cost_map[cand->id].cand = cand;

On Thu, May 28, 2020 at 8:24 PM Kewen.Lin  wrote:
>
>
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  
>
> * tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
> (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (dump_groups): Dump group with reg_offset_p.
> (record_group): Initialize reg_offset_p.
> (mark_reg_offset_groups): New function.
> (find_interesting_uses): Call mark_reg_offset_groups.
> (add_candidate_1): Update reg_offset_p if derived from reg_offset_p 
> group.
> (set_group_iv_cost): Scale up group cost with estimate_unroll_factor 
> if
> consider_reg_offset_for_unroll_p.
> (determine_iv_cost): Increase step cost with estimate_unroll_factor if
> consider_reg_offset_for_unroll_p.
> (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
> consider_reg_offset_for_unroll_p.
>
> 


Re: [PATCH v2] genemit.c (main): split insn-emit.c for compiling parallelly

2020-07-21 Thread Bin.Cheng via Gcc-patches
On Tue, Jul 21, 2020 at 11:14 AM Jojo  wrote:
>
> gcc/ChangeLog:
>
> * genemit.c (main): Print 'split line'.
> * Makefile.in (insn-emit.c): Define split count and file
>

Thanks for working one this, following comments are based on the
assumption that the approach is feasible after your investigation.

It's great to accelerate compilation time, do you have any number
showing how much this can achieve, on a typical machine with
reasonable parallelism?

> ---
>  gcc/Makefile.in | 10 ++
>  gcc/genemit.c   | 86 +++--
>  2 files changed, 58 insertions(+), 38 deletions(-)
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 2ba76656dbf..f805050a119 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1253,6 +1253,13 @@ ANALYZER_OBJS = \
>  # We put the *-match.o and insn-*.o files first so that a parallel make
>  # will build them sooner, because they are large and otherwise tend to be
>  # the last objects to finish building.
> +
> +insn-generated-split-num = 15
Hardcode number 15 looks strange here, how shall we know it's enough?
Or one step further, can we use some kind of general match "*" for
writing make rules here?
> +
> +insn-emit-split-c = $(foreach o, $(shell for i in 
> {1..$(insn-generated-split-num)}; do echo $$i; done), insn-emit$(o).c)
> +insn-emit-split-obj = $(patsubst %.c,%.o, $(insn-emit-split-c))
> +$(insn-emit-split-c): insn-emit.c
> +
>  OBJS = \
> gimple-match.o \
> generic-match.o \
> @@ -1260,6 +1267,7 @@ OBJS = \
> insn-automata.o \
> insn-dfatab.o \
> insn-emit.o \
> +   $(insn-emit-split-obj) \
> insn-extract.o \
> insn-latencytab.o \
> insn-modes.o \
> @@ -2367,6 +2375,8 @@ $(simple_generated_c:insn-%.c=s-%): s-%: 
> build/gen%$(build_exeext)
> $(RUN_GEN) build/gen$*$(build_exeext) $(md_file) \
>   $(filter insn-conditions.md,$^) > tmp-$*.c
> $(SHELL) $(srcdir)/../move-if-change tmp-$*.c insn-$*.c
> +   -csplit insn-$*.c /i\ am\ split\ line/ -k -s 
> {$(insn-generated-split-num)} -f insn-$* -b "%d.c"
> +   -( [ ! -s insn-$*0.c ] && for i in {1..$(insn-generated-split-num)}; 
> do touch insn-$*$$i.c; done && echo "" > insn-$*.c)
Not sure if this is the first time that csplit/coreutils is used,
shall we mention it here
https://gcc.gnu.org/install/prerequisites.html, even check it in
configure?
> $(STAMP) s-$*
>
>  # gencheck doesn't read the machine description, and the file produced
> diff --git a/gcc/genemit.c b/gcc/genemit.c
> index 84d07d388ee..fd60cdeeb96 100644
> --- a/gcc/genemit.c
> +++ b/gcc/genemit.c
> @@ -847,6 +847,45 @@ handle_overloaded_gen (overloaded_name *oname)
>  }
>  }
>
> +#define printf_include() \
> +  printf ("/* Generated automatically by the program `genemit'\n\
> +from the machine description file `md'.  */\n\n"); \
> +  printf ("#define IN_TARGET_CODE 1\n"); \
> +  printf ("#include \"config.h\"\n"); \
> +  printf ("#include \"system.h\"\n"); \
> +  printf ("#include \"coretypes.h\"\n"); \
> +  printf ("#include \"backend.h\"\n"); \
> +  printf ("#include \"predict.h\"\n"); \
> +  printf ("#include \"tree.h\"\n"); \
> +  printf ("#include \"rtl.h\"\n"); \
> +  printf ("#include \"alias.h\"\n"); \
> +  printf ("#include \"varasm.h\"\n"); \
> +  printf ("#include \"stor-layout.h\"\n"); \
> +  printf ("#include \"calls.h\"\n"); \
> +  printf ("#include \"memmodel.h\"\n"); \
> +  printf ("#include \"tm_p.h\"\n"); \
> +  printf ("#include \"flags.h\"\n"); \
> +  printf ("#include \"insn-config.h\"\n"); \
> +  printf ("#include \"expmed.h\"\n"); \
> +  printf ("#include \"dojump.h\"\n"); \
> +  printf ("#include \"explow.h\"\n"); \
> +  printf ("#include \"emit-rtl.h\"\n"); \
> +  printf ("#include \"stmt.h\"\n"); \
> +  printf ("#include \"expr.h\"\n"); \
> +  printf ("#include \"insn-codes.h\"\n"); \
> +  printf ("#include \"optabs.h\"\n"); \
> +  printf ("#include \"dfp.h\"\n"); \
> +  printf ("#include \"output.h\"\n"); \
> +  printf ("#include \"recog.h\"\n"); \
> +  printf ("#include \"df.h\"\n"); \
> +  printf ("#include \"resource.h\"\n"); \
> +  printf ("#include \"reload.h\"\n"); \
> +  printf ("#include \"diagnostic-core.h\"\n"); \
> +  printf ("#include \"regs.h\"\n"); \
> +  printf ("#include \"tm-constrs.h\"\n"); \
> +  printf ("#include \"ggc.h\"\n"); \
> +  printf ("#include \"target.h\"\n\n"); \

Can you use do {} while(0) style for defining this code block macro?
The trailing '\' is also strange here.
> +
>  int
>  main (int argc, const char **argv)
>  {
> @@ -862,49 +901,19 @@ main (int argc, const char **argv)
>/* Assign sequential codes to all entries in the machine description
>   in parallel with the tables in insn-output.c.  */
>
> -  printf ("/* Generated automatically by the program `genemit'\n\
> -from the machine description file `md'.  */\n\n");
> -
> -  printf ("#define IN_TARGET_CODE 1\n");
> -  printf ("#include \"config.h\"\n");
> -  

[PATCH PR95804]Force reduction partition to be scheduled in the last

2020-07-07 Thread bin.cheng via Gcc-patches
Hi,
This is a followup fix for PR95638 which changed the way post order numbers are 
maintained for
partition graph.  It missed one case that when SCC of reduction partition is 
broken by runtime
alias checks, we do need to make sure the reduction partition be scheduled in 
the last.  This patch
does this by forcing a negative post order to it.

Bootstrap and test on x86_64, is it OK?

Thanks,
bin

pr95804.patch
Description: Binary data


[PATCH PR95638]Record/restore postorder, rather than update it

2020-06-15 Thread bin.cheng via Gcc-patches
Hi,
This simple patch fixes wrong code issue as reported.  I tried to update 
postorder information after
the second call to graphds_scc with alias dependence edges skipped.  This 
wasn't working, and I
realize it's hard to do.  This patch simply records postorder information 
before the call and restores
after.  It also fixes memory leak.

Bootstrap and test on x86_64. Comments?

Thanks,
bin

2020-06-15  Bin Cheng  

gcc/
PR tree-optimization/95638
* tree-loop-distribution.c (pg_edge_callback_data): New field.
(loop_distribution::break_alias_scc_partitions): Record and restore
postorder information.  Fix memory leak.

gcc/testsuite/
PR tree-optimization/95638
* g++.dg/tree-ssa/pr95638.C: New test.

pr95638.txt
Description: Binary data


Re: [PATCH PR94969]Add unit distant vector to DDR in case of invariant access functions

2020-05-13 Thread Bin.Cheng via Gcc-patches
On Thu, May 14, 2020 at 1:46 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Wed, May 13, 2020 at 02:00:11PM +0200, Christophe Lyon via Gcc-patches 
> wrote:
> > > > 2020-05-11  Bin Cheng  
> > > >
> > > > PR tree-optimization/94969
> > > > * gcc.dg/tree-ssa/pr94969.c: New test.
> >
> > The new test fails on arm and aarch64 and probably everywhere:
> > gcc.dg/tree-ssa/pr94969.c: dump file does not exist
> > UNRESOLVED: gcc.dg/tree-ssa/pr94969.c scan-tree-dump-not Loop 1
> > distributed: split to 3 loops "ldist"
> >
> > Can you fix this?
>
> Seems a mere swapping of the scan-tree-dump-not args, I've verified the
> test passes after this change and fails with older trunk, committed to trunk
> as obvious.
Oh, sorry for the breakage, and thanks for fixing this.

Thanks,
bin
>
> 2020-05-13  Jakub Jelinek  
>
> PR testsuite/95110
> * gcc.dg/tree-ssa/pr94969.c: Swap scan-tree-dump-not arguments.
>
> --- gcc/testsuite/gcc.dg/tree-ssa/pr94969.c.jj  2020-05-13 09:24:36.959012780 
> +0200
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr94969.c 2020-05-13 19:13:53.664499322 
> +0200
> @@ -25,4 +25,4 @@ int main()
>  __builtin_abort ();
>  }
>
> -/* { dg-final { scan-tree-dump-not "ldist" "Loop 1 distributed: split to 3 
> loops"} } */
> +/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 3 loops" 
> "ldist" } } */
>
> Jakub
>


[PATCH PR94969]Add unit distant vector to DDR in case of invariant access functions

2020-05-10 Thread bin.cheng via Gcc-patches
Hi,
As analyzed in PR94969, data dependence analysis now misses dependence vector 
for specific case in which DRs in DDR have the same invariant access functions. 
 This simple patch fixes the issue by also covering invariant cases.  Bootstrap 
and test on x86_64, is it OK?

Thanks,
bin

2020-05-11  Bin Cheng  

PR tree-optimization/94969
* tree-data-dependence.c (constant_access_functions): Rename to...
(invariant_access_functions): ...this.  Add parameter.  Check for
invariant access function, rather than constant.
(build_classic_dist_vector): Call above function.
* tree-loop-distribution.c (pg_add_dependence_edges): Add comment.

gcc/testsuite
2020-05-11  Bin Cheng  

PR tree-optimization/94969
* gcc.dg/tree-ssa/pr94969.c: New test.

pr94969-20200511.txt
Description: Binary data


Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-04-08 Thread bin.cheng via Gcc-patches
--
Sender:Richard Biener 
Sent At:2020 Mar. 20 (Fri.) 18:12
Recipient:bin.cheng 
Cc:Andrew Pinski ; GCC Patches 
Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
-fstrict-enums


On Fri, Mar 20, 2020 at 10:27 AM bin.cheng  wrote:
>
> --
> Sender:Richard Biener 
> Sent At:2020 Mar. 3 (Tue.) 17:36
> Recipient:Andrew Pinski 
> Cc:bin.cheng ; GCC Patches 
> 
> Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
> -fstrict-enums
>
>
> On Mon, Mar 2, 2020 at 6:14 PM Andrew Pinski  wrote:
> >
> > On Mon, Mar 2, 2020 at 1:40 AM Richard Biener
> >  wrote:
> > >
> > > On Mon, Mar 2, 2020 at 9:07 AM bin.cheng  
> > > wrote:
> > > >
> > > > Hi,
> > > > This is a simple fix for PR93674.  It adds cand carefully for enumeral 
> > > > type iv_use in
> > > > case of -fstrict-enums, it also avoids computing, replacing iv_use with 
> > > > the candidate
> > > > so that no IV of enumeral type is introduced with -fstrict-enums option.
> > > >
> > > > Testcase is also added.  Bootstrap and test on x86_64.  Any comment?
> > >
> > > I think we should avoid enum-typed (or bool-typed) IVs in general, not 
> > > just
> > > with -fstrict-enums.  That said, the ENUMERAL_TYPE checks should be
> > > !(INTEGER_TYPE || POINTER_TYPE_P) checks.
> >
> > Maybe even check type_has_mode_precision_p or
> > TYPE_MIN_VALUE/TYPE_MAX_VALUE have the same as the min/max for that
> > precision/signedness.
>
> Indeed we don't want non-mode precision INTEGER_TYPE IVs either.  I wouldn't
> check TYPE_MIN/MAX_VALUE here though.
>
> Here is the updated patch with more checks.  I also removed the 
> computation/elimination part for now.

+  if ((TREE_CODE (basetype) != INTEGER_TYPE && !POINTER_TYPE_P (basetype))
+  || !type_has_mode_precision_p (basetype))
+{
+  if (!CONVERT_EXPR_P (iv->base))
+   return;
+
+  /* Specially handle such iv_use if it's converted from other ones by
+adding candidate for the source iv.  */
+  base = TREE_OPERAND (iv->base, 0);
+  basetype = TREE_TYPE (base);
+  if (!INTEGRAL_TYPE_P (basetype))
+   return;

I think this is too lax - you'll happily add another non-INTEGER_TYPE
or non-mode-precision IV through this handling.  If the conversion
is not value-preserving the IV may also be useless (also the step
might see truncation in its type adjustment).  With a widening
conversion there's the issue with different wrap-around behavior.
I guess in most cases we'll not be able to use the IV if it is "bogus"
so all this might be harmless.

The original idea is to depend on the condition that the result won't be a 
legal IV in
case of widening.  For narrowing, it won't be a problem except what you 
mentioned
about non-INTEGER_TYPE or non-mode-precision issue.

Was there any reason to handle this special case?  Why not simply
do sth like
The reason is to strength reduce (conversion) operation in RHS by introducing 
appropriate
candidate.
And right, your proposed code looks better and more general.

+  if ((TREE_CODE (basetype) != INTEGER_TYPE && !POINTER_TYPE_P (basetype)))
+  || !type_has_mode_precision_p (basetype))
+{
 basetype = lang_hooks.type_for_mode (TYPE_MODE (basetype),
TYPE_UNSIGNED (basetype));
 tree step = fold_convert (basetype, iv->step);
 add_candidate (...);

?

The following also looks bogus:

+  if (basetype == generic_type_for (basetype))
+   step = fold_convert (basetype, step);

don't we add IVs in generic_type_for () and thus the conversion is
always necessary
anyways?  (it has the wrong type from the conversion we just stripped)

This is now discarded.
Patch updated and tested on x86_64.  Is it OK?

Thanks,
bin

2020-04-09  Bin Cheng  
Richard Biener  

PR tree-optimization/93674
* tree-ssa-loop-ivopts.c (langhooks.h): New include.
(add_iv_candidate_for_use): For iv_use of non integer or pointer type,
or non-mode precision type, add candidate in unsigned type with the
same precision.

gcc/testsuite
2020-04-09  Bin Cheng  

PR tree-optimization/93674
* g++.dg/pr93674.C: New test.

0001-Add-unsigned-type-iv_cand-for-iv_use-with-non-mode-p.patch
Description: Binary data


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-04-08 Thread Bin.Cheng via Gcc-patches
On Tue, Apr 7, 2020 at 11:45 PM Iain Sandoe  wrote:
>
> Bin.Cheng  wrote:
>
> > On Mon, Mar 16, 2020 at 5:34 PM Bin.Cheng  wrote:
> >> On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
> >>> Bin.Cheng  wrote:
> >>>
>  On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
>
> >>> If it helps, I could push a branch to users/iains/ on the FSF git server 
> >>> with this sequence.
> >> Sorry for being slow replying.  This is weird, were you testing against 
> >> trunk?
>
> 1/ @Jun Ma
>
>   - I think your observation is correct, we should enusre that cleanups are 
> applied where needed
>to any temps that remain after we’ve those captured by reference.
>
>   However, I would prefer that we do this as required, rather than assuming 
> it will be needed so
>   I have an updated patch below.
Sorry it's been quite long, I don't remember the detail of this
change.  Since this new version fixes the ICE too, please go ahead
commit it.
>
> 2/ @ Bin / Jun Ma
>
>   - The problem is that the testcase seems to be fragile, perhaps it was a 
> c-reduced one?
It's strange, however, I don't see being c-reduced has impact here in
this scenario.
>
> So… I do not propose to add this test-case at this point (perhaps we could 
> construct one that actually
> tests the required behaviour - that cleanups are still  run correctly for 
> temps that are not promoted by
> capture)?
>
> Anyway to avoid further delay, I think we should apply the patch below (I 
> have other fixes on top of this
> for open PRs)
>
> OK for master?
Yes, please.  Also one of approved patch is waiting for this one.

Thanks,
bin
> thanks
> Iain
>
> coroutines: Add cleanups, where required, to statements with captured 
> references.
>
> When we promote captured temporaries to local variables, we also
> remove their initializers from the relevant call expression.  This
> means that we should recompute the need for a cleanup expression
> once the set of temporaries that remains becomes known.
>
> gcc/cp/ChangeLog:
>
> 2020-04-07  Iain Sandoe  
> Jun Ma  
>
> * coroutines.cc (maybe_promote_captured_temps): Add a
> cleanup expression, if needed, to any call from which
> we promoted temporaries captured by reference.
>
> diff --git a/gcc/cp/coroutines.cc b/gcc/cp/coroutines.cc
> index 983fa650b55..936be06c336 100644
> --- a/gcc/cp/coroutines.cc
> +++ b/gcc/cp/coroutines.cc
> @@ -2798,11 +2798,13 @@ maybe_promote_captured_temps (tree *stmt, void *d)
>location_t sloc = EXPR_LOCATION (*stmt);
>tree aw_bind
> = build3_loc (sloc, BIND_EXPR, void_type_node, NULL, NULL, NULL);
> -  tree aw_statement_current;
> -  if (TREE_CODE (*stmt) == CLEANUP_POINT_EXPR)
> -   aw_statement_current = TREE_OPERAND (*stmt, 0);
> -  else
> -   aw_statement_current = *stmt;
> +
> +  /* Any cleanup point expression might no longer be necessary, since we
> +are removing one or more temporaries.  */
> +  tree aw_statement_current = *stmt;
> +  if (TREE_CODE (aw_statement_current) == CLEANUP_POINT_EXPR)
> +   aw_statement_current = TREE_OPERAND (aw_statement_current, 0);
> +
>/* Collected the scope vars we need move the temps to regular. */
>tree aw_bind_body = push_stmt_list ();
>tree varlist = NULL_TREE;
> @@ -2843,8 +2845,12 @@ maybe_promote_captured_temps (tree *stmt, void *d)
>   /* Replace all instances of that temp in the original expr.  */
>   cp_walk_tree (_statement_current, replace_proxy, , NULL);
> }
> -  /* What's left should be the original statement with any temporaries
> -broken out.  */
> +
> +  /* What's left should be the original statement with any co_await
> +captured temporaries broken out.  Other temporaries might remain
> +so see if we need to wrap the revised statement in a cleanup.  */
> +  aw_statement_current =
> +   maybe_cleanup_point_expr_void (aw_statement_current);
>add_stmt (aw_statement_current);
>BIND_EXPR_BODY (aw_bind) = pop_stmt_list (aw_bind_body);
>awpts->captured_temps.empty ();
>


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-03-26 Thread Bin.Cheng via Gcc-patches
On Mon, Mar 16, 2020 at 5:34 PM Bin.Cheng  wrote:
>
> On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
> >
> > Bin.Cheng  wrote:
> >
> > > On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
> >
> > >> With current trunk + Bin’s two approved patches.
> > >>
> > >> I see no change in the testcase (lambda-09-capture-object.C) before / 
> > >> after
> > >> the patch
> > >>  (it fails for me at -O0 only - in both cases).
> > >
> >
> > > I tried exactly what you did, however, the result is different.
> >
> > I am a bit concerned that we get different results - yesterday I retested 
> > this on:
> >   Linux : x86_64-linux (cfarm gcc122)
> >   Darwin (x86_64-darwin16),
> >  with the same results on both platforms.
> >
> > 1) I applied the two testcases (note I have renamed 
> > lambda-09-capture-object.C => lambda-11-capture-object.C so that the test 
> > numbers are sequential).  However, I have not changed the test in any other 
> > way.
> >
> > Native configuration is x86_64-pc-linux-gnu
> >
> > === g++ tests ===
> >
> > Schedule of variations:
> > unix/-m32
> > unix/-m64
> >
> > Running target unix/-m32
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> > errors)
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> > excess errors)
> >
> > Running target unix/-m64
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> > errors)
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> > excess errors)
> >
> > ^^ so, this shows that both tests fail (co-await-syntax-10.C is expected)
> >
> > 2) I apply Bin’s patch (Pickup more CO_AWAIT_EXPR expanding cases) (which 
> > is approved)
> >
> > Native configuration is x86_64-pc-linux-gnu
> >
> > === g++ tests ===
> >
> > Schedule of variations:
> > unix/-m32
> > unix/-m64
> >
> > Running target unix/-m32
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> > errors)
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> > excess errors)
> >
> > Running target unix/-m64
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > 

Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-03-20 Thread bin.cheng via Gcc-patches
--
Sender:Richard Biener 
Sent At:2020 Mar. 3 (Tue.) 17:36
Recipient:Andrew Pinski 
Cc:bin.cheng ; GCC Patches 

Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
-fstrict-enums


On Mon, Mar 2, 2020 at 6:14 PM Andrew Pinski  wrote:
>
> On Mon, Mar 2, 2020 at 1:40 AM Richard Biener
>  wrote:
> >
> > On Mon, Mar 2, 2020 at 9:07 AM bin.cheng  
> > wrote:
> > >
> > > Hi,
> > > This is a simple fix for PR93674.  It adds cand carefully for enumeral 
> > > type iv_use in
> > > case of -fstrict-enums, it also avoids computing, replacing iv_use with 
> > > the candidate
> > > so that no IV of enumeral type is introduced with -fstrict-enums option.
> > >
> > > Testcase is also added.  Bootstrap and test on x86_64.  Any comment?
> >
> > I think we should avoid enum-typed (or bool-typed) IVs in general, not just
> > with -fstrict-enums.  That said, the ENUMERAL_TYPE checks should be
> > !(INTEGER_TYPE || POINTER_TYPE_P) checks.
>
> Maybe even check type_has_mode_precision_p or
> TYPE_MIN_VALUE/TYPE_MAX_VALUE have the same as the min/max for that
> precision/signedness.

Indeed we don't want non-mode precision INTEGER_TYPE IVs either.  I wouldn't
check TYPE_MIN/MAX_VALUE here though.

Here is the updated patch with more checks.  I also removed the 
computation/elimination part for now.

Thanks,
bin
2020-03-19  Bin Cheng  

PR tree-optimization/93674
* tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Don't add iv_cand
for iv_use which is not integer or pointer type, or non-mode precision
type.  For such cases, add iv_cand for source operand if it's
converted from other one.
(get_computation_cost, may_eliminate_iv): Avoid compute, eliminate
iv_use with enumeral type iv_cand in case of -fstrict-enums.

gcc/testsuite
2020-03-19  Bin Cheng  

PR tree-optimization/93674
* g++.dg/pr93674.C: New test.

0001-Don-t-add-iv_cand-for-iv_use-with-non-mode-precision.patch
Description: Binary data


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-03-16 Thread Bin.Cheng via Gcc-patches
On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
>
> Bin.Cheng  wrote:
>
> > On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
>
> >> With current trunk + Bin’s two approved patches.
> >>
> >> I see no change in the testcase (lambda-09-capture-object.C) before / after
> >> the patch
> >>  (it fails for me at -O0 only - in both cases).
> >
>
> > I tried exactly what you did, however, the result is different.
>
> I am a bit concerned that we get different results - yesterday I retested 
> this on:
>   Linux : x86_64-linux (cfarm gcc122)
>   Darwin (x86_64-darwin16),
>  with the same results on both platforms.
>
> 1) I applied the two testcases (note I have renamed 
> lambda-09-capture-object.C => lambda-11-capture-object.C so that the test 
> numbers are sequential).  However, I have not changed the test in any other 
> way.
>
> Native configuration is x86_64-pc-linux-gnu
>
> === g++ tests ===
>
> Schedule of variations:
> unix/-m32
> unix/-m64
>
> Running target unix/-m32
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> Running target unix/-m64
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> ^^ so, this shows that both tests fail (co-await-syntax-10.C is expected)
>
> 2) I apply Bin’s patch (Pickup more CO_AWAIT_EXPR expanding cases) (which is 
> approved)
>
> Native configuration is x86_64-pc-linux-gnu
>
> === g++ tests ===
>
> Schedule of variations:
> unix/-m32
> unix/-m64
>
> Running target unix/-m32
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> Running target unix/-m64
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> 

[PATCH PR94125]Update post order number for merged SCC

2020-03-15 Thread bin.cheng via Gcc-patches
Hi,
This simple patch fixes PR94125 by updating post order number for merged SCC.
The root cause is after computing SCC with runtime alias edges skipped, the post
order info is changed and it's possible a partition is scheduled after another 
where
should be scheduled before.  Note that updating to the minimal post order number
is safe, only partitions separated from SCC because of skipping runtime alias 
edges
can be assigned larger post oder than the original precedent one.

Bootstrap and test on x86_64, test case also added.

Thanks,
bin

2020-03-15  Bin Cheng  

PR tree-optimization/94125
* tree-loop-distribution.c
(loop_distribution::break_alias_scc_partitions): Update post ordeer
number for merged scc.

gcc/testsuite
2020-03-15  Bin Cheng  

PR tree-optimization/94125
* gcc.dg/tree-ssa/pr94125.c: New test.

0001-pr94125.patch
Description: Binary data