Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1

Xionghu Luo via Gcc-patches Mon, 30 Aug 2021 01:49:45 -0700


On 2021/8/27 15:45, Richard Biener wrote:

On Thu, 26 Aug 2021, Xionghu Luo wrote:



On 2021/8/24 16:20, Richard Biener wrote:

On Tue, 24 Aug 2021, Xionghu Luo wrote:



On 2021/8/19 20:11, Richard Biener wrote:

-  class loop *inn_loop = loop;

if (ALWAYS_EXECUTED_IN (loop->header) == NULL)

        {
@@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
contains_call)
             to disprove this if possible).  */
          if (bb->flags & BB_IRREDUCIBLE_LOOP)
            break;
-
-         if (!flow_bb_inside_loop_p (inn_loop, bb))
-           break;
-
-         if (bb->loop_father->header == bb)
-           {
-             if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-               break;
-
-             /* In a loop that is always entered we may proceed anyway.
-                But record that we entered it and stop once we leave it.  */
-             inn_loop = bb->loop_father;
-           }
        }

while (1)

I'm not sure this will work correct (I'm not sure how the existing
code makes it so either...).  That said, I can't poke any hole
into the change.  What I see is that definitely

             if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
               last = bb;

             if (bitmap_bit_p (contains_call, bb->index))
               break;

doesn't work reliably since the DOM ordering will process blocks
A B and C in random order for

     for (;;)
      {
         if (cond)
           {
             A: foo ();
           }
         else B:;
         C:;
      }

and thus we can end up setting 'last' to C_before_  processing
'A' and thus arriving at the call foo () ...

get_loop_body_in_dom_order does some "special sauce" but not
to address the above problem - but it might be that a subtle
issue like the above is the reason for the inner loop handling.
The inner loop block order does_not_  adhere to this "special sauce",
that is - the "Additionally, if a basic block s dominates
the latch, then only blocks dominated by s are be after it."
guarantee holds for the outer loop latch, not for the inner.

Digging into the history of fill_always_executed_in_1 doesn't
reveal anything - the inner loop handling has been present
since introduction by Zdenek - but usually Zdenek has a reason
for doing things as he does;)


Yes, this is really complicated usage, thanks for point it out. :)
I constructed two cases to verify this with inner loop includes "If A; else B; 
C".
Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
C is postponed, the access order is ABC, 'last' won’t be set to C if A or B 
contains call;


But it depends on the order of visiting ABC and that's hard to put into
a testcase since it depends on the order of edges and the processing
of the dominance computation.  ABC are simply unordered with respect
to a dominator walk.

Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
the access order is CAB, but 'last' also won’t be updated to C in 
fill_always_executed_in_1
since there is also dominate check, then if A or B contains call, it could break
successfully.

C won't be set to ALWAYS EXECUTED for both circumstance.


Note it might be simply a measure against quadratic complexity,
esp. since with your patch we also dive into not always executed
subloops as you remove the

                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                   break;

check.  I suggest to evaluate behavior of the patch on a testcase
like

void foo (int n, int **k)
{
     for (int i = 0; i < n; ++i)
       if (k[0][i])
         for (int j = 0; j < n; ++j)
           if (k[1][j])
             for (int l = 0; l < n; ++l)
               if (k[2][l])
                 ...
}


Theoretically the complexity is changing from L1(bbs) to 
L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
so fill_always_executed_in_1's execution time is supposed to be increase from
O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also 
drafted a
test case has 73-depth loop function with 25 no-ipa function copies each 
compiled
in lim2 and lim4 dependently.  Total execution time of 
fill_always_executed_in_1 is
increased from 32ms to 58ms, almost doubled but not quadratic?


It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
O(n^2).

It seems reasonable to see compiling time getting longer since most bbs are 
checked
more but a MUST to ensure early break correctly in every loop level...
Though loop nodes could be huge, loop depth will never be so large in actual 
code?


The "in practice" argument is almost always defeated by automatic
program generators ;)

I suspect you'll see quadratic behavior with your patch. You

should be at least able to preserve a check like

             /* Do not process not always executed subloops to avoid
                quadratic behavior.  */
             if (bb->loop_father->header == bb
                 && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
               break;

which is of course not optimistic for cases like

     for (..)
      {
        if (cond)
          for (..)
        x = 1; // this is always executed if the inner loop is finite
      }

but we need to have an eye on the complexity of this function.  I would
have suggested to do greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
is empty proceed to inner loops as the current code does.  For
bookkeeping simply keep a to-visit-incoming-edges counter per BB.
Pseudo-code:

     bitmap_set_bit (worklist, loop-header-bb);
     while (!bitmap_empty_p (worklist))
       {
         bb = pop (worklist);


Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?


Ah, sure.

         SET_ALWAYS_EXECUTED_IN (bb, loop);
         if (bitmap_bit_p (contains_call, bb->index))
           continue;
         FOR_EACH_EDGE (e, ei, bb->succs)
           {
             if (!flow_bb_inside_loop_p (loop, e->dest))
               continue;
             if (incoming_count[e->dest->index]-- == 0)
               push (worklist, e->dest);
           }
       }


Sorry I don't fully understand your algorithm.  worklist should be
auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
the bb's preds count, why need retain it since it it decreased to 0?


the worklist is a auto_bitmap, pop () would be
bitmap_first_set_bit/clear_bit.  One could use a vector with the
caveat of eventually adding duplicate members to the worklist.
But as said, it's pseudo-code ;)


Thanks a lot!

iterate over inner loops (incoming_count can be retained,
     we just force the inner loop header onto the worklist).


Is this same to ?

    for (loop = loop->inner; loop; loop = loop->next)
      fill_always_executed_in_1 (loop, contains_call)


Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
should iterate over loops in PRE order.


Tried implementation with your pseudo-code, it works like the
previous v2 solution, bb after inner loops could be marked
as ALWAYS EXECUTED.  You are really so strong! ;)
Regression tested pass on P8LE.

Assume,

   A: GCC Base
   B: inner loop check removal
   C: incoming count

Execution time of fill_always_executed_in_1 for the 73-depth loop
(with five function copies) is
(A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?


Sorry didn't get C's execution time accurate enough due to forgot to remove
dump file code in it.
C's execution time is 25 ms after code refine.  It's even better than A.
Attached the test case ssa-lim-22.c used for time measurement.



Bumped the patch to v3 with TODOs:
  1. The time measurement code will be removed if the v3 is OK;
  2. loops_list is not used as my code is not rebased to upstream yet.
  3. Function comments is not updated.



[PATCH v3] Fix incomplete computation in fill_always_executed_in_1


ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
When the worklist is empty proceed to inner loops as the current
code does.  For bookkeeping simply keep a to-visit-incoming-edges
counter per BB, and pass it through iteration over inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

        * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
        inn_loop check.

gcc/testsuite/ChangeLog:

        * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
        * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
  gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
  3 files changed, 111 insertions(+), 38 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..72be6b6bfac 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,39 +3192,52 @@ do_store_motion (void)
     blocks that contain a nonpure call.  */

static void

-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
  {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
    edge e;
-  class loop *inn_loop = loop;

if (ALWAYS_EXECUTED_IN (loop->header) == NULL)

      {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_clear (work_set);


bitmap_clear isn't necessary


Removed.

+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;

- for (i = 0; i < loop->num_nodes; i++)

-       {
-         edge_iterator ei;
-         bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);


I don't think you need to copy 'bbi' but you can re-use the
state from the outer loop processing.  Did you run into any
issues with that?


Yes.  For example, adding a small if-else block to ssa-lim-19.c,
Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
then if fill_always_executed_in_1 is called again for loop 2, it's value is not
reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.


struct X { int i; int j; int k;};

void foo(struct X *x, int n, int l, int m)
{
 for (int j = 0; j < l; j++)  // loop 1
   {
     for (int i = 0; i < n; ++i)  // loop 2
       {
         if (m)
           x->j++;
         else
           x->j = m+n+l;

         int *p = &x->j;   // bb 6
         int tem = *p;
         x->j += tem * i;
       }
     int *r = &x->k;
     int tem2 = *r;
     x->k += tem2 * j;
   }
}

+      while (!bitmap_empty_p (work_set))
+       {
+         bb_index = bitmap_first_set_bit (work_set);
+         bitmap_clear_bit (work_set, bb_index);
+         bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
          if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-           last = bb;
-
+           SET_ALWAYS_EXECUTED_IN (bb, loop);
          if (bitmap_bit_p (contains_call, bb->index))
            break;


I think you want to continue; here (process remaining worklist
but not continue greedy walking this block)


Same as above, if use 'continue' instead of 'break', the algorithm
seems also not work again.  If inner loop contains a jump to outmost
loop, the blocks after the jump block will be set to ALWAYS EXECUTE
incorrectly.

-
+         edge_iterator ei;
          FOR_EACH_EDGE (e, ei, bb->succs)
            {
-             /* If there is an exit from this BB.  */
              if (!flow_bb_inside_loop_p (loop, e->dest))
                break;


in particular this should keep the outer 'bbi' valid to re-use.

But again, you want 'continue;' the greedy walk to other edges.
If that's not valid (I'd need to think about this) then with
your patch whether we process an edge depends on the order
of the edge visit so you'd have to walk successors twice,
once to determine whether we can greedily walk any of it
and once to actually do the greedy walk.

So thinking about it an exit edge is like a not returning call
and thus we indeed should not process any outgoing edges of
this block.

+
              /* Or we enter a possibly non-finite loop.  */
              if (flow_loop_nested_p (bb->loop_father,
                                      e->dest->loop_father)
                  && ! finite_loop_p (e->dest->loop_father))
                break;


I think this is no longer necessary?  In any case it would
again be 'continue;', see also above.

I'm missing skipping of backedges in the walk.

+
+             if (bbd[e->dest->index] == 1)
+             {
+               bitmap_set_bit (work_set, e->dest->index);
+               bbd[e->dest->index]--;
+             }
+             else if (bbd[e->dest->index] > 1)
+               bbd[e->dest->index]--;


maybe just

             bbd[e->dest->index]--;
             if (bbd[e->dest->index] == 0)
               bitmap_set_bit (work_set, e->dest->index);


They are not equivalent.  Former won't decrease if bbd[e->dest->index]
is 0, but the latter does.

            }
+
          if (e)
            break;


continue; processing the worklist

@@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
             to disprove this if possible).  */
          if (bb->flags & BB_IRREDUCIBLE_LOOP)
            break;


continue; processing the worklist.  I think this has to
come early, before processing successors, next to
the contains_call processing.  We could think of ignoring
entering of irreducible regions by tracking exits from them,
but I'm not sure we're identifying the regions in a good
enough way to allow this.

Likewise possibly infinite inner loops can be processed
when we track exits from those which would be sth like

     /* When this is an exit from an inner loop that we cannot
        prove as finite do not follow this edge.  */
     if (bb->loop_father != loop
         && loop_exit_edge_p (bb->loop_father, e)
         && ! finite_loop_p (bb->loop_father))
       continue;

-
-         if (!flow_bb_inside_loop_p (inn_loop, bb))
-           break;
-
-         if (bb->loop_father->header == bb)
-           {
-             if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-               break;
-
-             /* In a loop that is always entered we may proceed anyway.
-                But record that we entered it and stop once we leave it.  */
-             inn_loop = bb->loop_father;
-           }
-       }
-
-      while (1)
-       {
-         SET_ALWAYS_EXECUTED_IN (last, loop);
-         if (last == loop->header)
-           break;
-         last = get_immediate_dominator (CDI_DOMINATORS, last);
        }
-
-      free (bbs);
+      free (bbd);
      }

for (loop = loop->inner; loop; loop = loop->next)

-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bbi);
  }

/* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.

@@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
        bitmap_set_bit (contains_call, bb->index);
      }

+ mark_dfs_back_edges ();


Please put this to loop_invariant_motion_in_fun right before
the call to fill_always_executed_in.


Done.

+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XNEWVEC (int, array_size);


There's XCNEWVEC that also zeros the array.


OK, thanks.

+  memset (bb_incoming_edges, 0, array_size * sizeof (int));
+  edge_iterator ei;
+  edge e;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+       if (e->flags & EDGE_DFS_BACK)
+         len--;
+      bb_incoming_edges[bb->index] = len;
+    }


it would be nice to combine this with the preceeding loop over
all blocks.


Done.

+  static unsigned long diff = 0;
+  struct timeval tv;
+  gettimeofday (&tv, NULL);
+  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+
+  //for (loop : loops_list (cfun, 0))
    for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
+
+  gettimeofday (&tv, NULL);
+  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+  diff += end - begin;
+  //printf ("%s,diff:%ld\n", current_function_name (), diff);
+  free (bb_incoming_edges);


Eh, looks very much like work in progress.  From the O() complexity
point duplicating bb_incoming_edges for each loop makes it quadratic
again but as said I think we don't need this duplication.


Sorry I didn't quite follow your above comments about reusing bb_incoming_edges,
we are searching the bbs that dominate loop->latch loop by loop independently,
if outer loop changes the inner loops bb_incoming_edges, it will make the inner
loop's ALWAYS EXECUTE computation incorrect?

Seems duplicating bb_incoming_edges not increases the complexity but the
recursive call does?


I'd like to see the recursion removed, did the for (loop : ...)
not work out?


The recursion in fill_always_executed_in_1 could be removed since loops_list
could iterate all inner loops but
"for (loop = current_loops->tree_root->inner; loop; loop = loop->next)"
only iterate sibling level loops?

--
Thanks,
Xionghu

From 6047a7b8bbef48794bc3d1608c64ae82e5716430 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luo...@linux.ibm.com>
Date: Tue, 17 Aug 2021 20:33:26 -0500
Subject: [PATCH v4] Fix incomplete computation in fill_always_executed_in_1

ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch.
When the worklist is empty proceed to inner loops.  For bookkeeping
simply keep a to-visit-incoming-edges counter per BB, and pass it
through to inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

        * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
        inn_loop check.  Greedy visit loop header successors and update
        edge counts.
        (fill_always_executed_in): Compute bb_incoming_edges and pass
        down.
        (loop_invariant_motion_in_fun): Compute dfs back edge.

gcc/testsuite/ChangeLog:

        * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
        * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 109 ++++++++++++---------
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  31 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  30 ++++++
 3 files changed, 125 insertions(+), 45 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..5d69d47eaa6 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,30 +3192,48 @@ do_store_motion (void)
    blocks that contain a nonpure call.  */
 
 static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
   edge e;
-  class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
 
-      for (i = 0; i < loop->num_nodes; i++)
-       {
-         edge_iterator ei;
-         bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XCNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
 
+      while (!bitmap_empty_p (work_set))
+       {
+         bb_index = bitmap_first_set_bit (work_set);
+         bitmap_clear_bit (work_set, bb_index);
+         bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
          if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-           last = bb;
-
+           {
+             SET_ALWAYS_EXECUTED_IN (bb, loop);
+             printf ("always executed: bb->index:%d, loop->num: %d\n", 
bb->index, loop->num);
+             if (dump_file && (dump_flags & TDF_DETAILS))
+               fprintf (dump_file,
+                        "always executed: bb->index:%d, loop->num: %d\n",
+                        bb->index, loop->num);
+           }
          if (bitmap_bit_p (contains_call, bb->index))
            break;
 
+         /* A loop might be infinite (TODO use simple loop analysis
+            to disprove this if possible).  */
+         if (bb->flags & BB_IRREDUCIBLE_LOOP)
+           break;
+
+         edge_iterator ei;
          FOR_EACH_EDGE (e, ei, bb->succs)
            {
+             if (e->dest == loop->latch)
+               break;
              /* If there is an exit from this BB.  */
              if (!flow_bb_inside_loop_p (loop, e->dest))
                break;
@@ -3224,42 +3242,20 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
contains_call)
                                      e->dest->loop_father)
                  && ! finite_loop_p (e->dest->loop_father))
                break;
-           }
-         if (e)
-           break;
-
-         /* A loop might be infinite (TODO use simple loop analysis
-            to disprove this if possible).  */
-         if (bb->flags & BB_IRREDUCIBLE_LOOP)
-           break;
-
-         if (!flow_bb_inside_loop_p (inn_loop, bb))
-           break;
-
-         if (bb->loop_father->header == bb)
-           {
-             if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-               break;
 
-             /* In a loop that is always entered we may proceed anyway.
-                But record that we entered it and stop once we leave it.  */
-             inn_loop = bb->loop_father;
+             if (bbd[e->dest->index] == 1)
+             {
+               bitmap_set_bit (work_set, e->dest->index);
+               bbd[e->dest->index]--;
+             }
+             else if (bbd[e->dest->index] > 1)
+               bbd[e->dest->index]--;
            }
-       }
-
-      while (1)
-       {
-         SET_ALWAYS_EXECUTED_IN (last, loop);
-         if (last == loop->header)
+         if (e)
            break;
-         last = get_immediate_dominator (CDI_DOMINATORS, last);
        }
-
-      free (bbs);
+      free (bbd);
     }
-
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
 }
 
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3270,12 +3266,22 @@ static void
 fill_always_executed_in (void)
 {
   basic_block bb;
-  class loop *loop;
+  edge_iterator ei;
+  edge e;
+
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XCNEWVEC (int, array_size);
 
   auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
   bitmap_clear (contains_call);
   FOR_EACH_BB_FN (bb, cfun)
     {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+       if (e->flags & EDGE_DFS_BACK)
+         len--;
+      bb_incoming_edges[bb->index] = len;
+
       gimple_stmt_iterator gsi;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
        {
@@ -3287,8 +3293,19 @@ fill_always_executed_in (void)
        bitmap_set_bit (contains_call, bb->index);
     }
 
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+  static unsigned long diff = 0;
+  struct timeval tv;
+  gettimeofday (&tv, NULL);
+  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+
+  for (auto loop : loops_list (cfun, 0))
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
+
+  gettimeofday (&tv, NULL);
+  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+  diff += end - begin;
+  printf ("%s,diff:%ld\n", current_function_name (), diff);
+  free (bb_incoming_edges);
 }
 
 
@@ -3391,6 +3408,8 @@ loop_invariant_motion_in_fun (function *fun, bool 
store_motion)
   /* Gathers information about memory accesses in the loops.  */
   analyze_memory_references (store_motion);
 
+  mark_dfs_back_edges ();
+
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..3d5fe6f0314
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,31 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l, int m)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+       {
+#if 1
+         if (m)
+           x->j++;
+         else
+           x->j = m+n+l;
+#endif
+
+         int *p = &x->j;
+         int tem = *p;
+         x->j += tem * i;
+       }
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+       {
+         *q += 1;
+         if (f)
+           goto out;
+       }
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c

#include <stdlib.h>
#include <stdio.h>

int n; int **k; int m;

#define FOOAB(A, B) void __attribute__((noipa)) foo##A##B () { \
  for (int i = 0; i < n; ++i) \
  { \
    if (k[0][i]) \
    { \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
        if (k[1][j]) \
          for (int l = 0; l < n; ++l) \
            if (k[2][l]) \
              for (int a = 0; a < n; ++a) \
                if (k[3][a]) \
                  for (int b = 0; b < n; ++b) \
                    if (k[4][b]) \
                      for (int c = 0; c < n; ++c) \
                        if (k[5][c]) \
                          for (int a = 0; a < n; ++a) \
                            if (k[3][a]) \
                              for (int b = 0; b < n; ++b) \
                                if (k[4][b]) \
                                  for (int c = 0; c < n; ++c) \
                                  { \
                                    if (k[5][c]) \
                                      { \
                                        int *r = &k[2][m]; \
                                        int tem2 = *r; \
                                        k[2][m] += tem2 * i; \
                                      } \
                                    else \
                                      { \
                                        int *r = &k[5][m]; \
                                        int tem2 = *r; \
                                        k[5][m] += tem2 * i; \
                                      } \
                                    if (k[7][c]) \
                                      { \
                                        int *r = &k[3][m]; \
                                        int tem2 = *r; \
                                        k[3][m] += tem2 * i; \
                                      } \
                                    else \
                                      { \
                                        int *r = &k[3][m]; \
                                        int tem2 = *r; \
                                        k[3][m] += tem2 * i; \
                                      } \
                                    int *t = &k[8][m]; \
                                    int tem2 = *t; \
                                    k[8][m] += tem2 * i; \
                                  } \
    } \
    int *s = &k[4][m]; \
    int tem2 = *s; \
    k[4][m] += tem2 * i; \
  } \
} \

#define FOOB(B)  FOOAB (1,B)\
FOOAB (2,B) \
FOOAB (3,B) \
FOOAB (4,B) \
FOOAB (5,B)

FOOB (0)
#if 1
FOOB (1)
FOOB (2)
FOOB (3)
FOOB (4)
#if 0
FOOB (5)
FOOB (6)
FOOB (7)
FOOB (8)
FOOB (9)
#endif
#endif

Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1

Reply via email to