https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102906

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P1
   Last reconfirmed|                            |2021-11-09
             Status|UNCONFIRMED                 |NEW
          Component|testsuite                   |tree-optimization
             Target|                            |arm
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Aldy Hernandez from comment #10)
> (In reply to Christophe Lyon from comment #9)
> > (In reply to Aldy Hernandez from comment #7)
> > > > then size ivopts-4.o:
> > > >    text    data     bss     dec     hex filename
> > > >      38       0       0      38      26 ivopts-4.o
> > > > where the testcase expects text <= 36
> > > 
> > > Ohhhhh, this is an object size regression?  This test seems very fragile. 
> > > Jump threading will alter code size, so any change in the threading rules
> > > will likely have an effect on code size.  I suggest you add
> > > -fno-thread-jumps to the test and adjust the object-size test accordingly.
> > 
> > I tried that, it doesn't change the generated code.
> 
> The difference from before the commit til now is that there was a threading
> path that was valid but is now disallowed.  So adding -fno-thread-jumps
> won't have any effect since it won't cause the disallowed threading path to
> reappear.  What I'm saying is that the test should be calibrated to the new
> normal.
> 
> Of course, it could be that another pass should pick up the slack here, or
> that the restriction is too strict.  Richi, do you have some insight here?
> 
> At least for -Os  -mthumb -mfloat-abi=hard -mfpu=neon -mtls-dialect=gnu
> -mlibarch=armv7-a+mp+sec+neon-fp16 -march=armv7-a+mp+sec+neon-fp16, the
> difference in the threader is that we used to thread 6->4->3 in DOM2, but we
> no longer do so because doing so would rotate the loop:
> 
> a.c.127t.dom2:Path rotates loop:   Cancelling jump thread: (6, 4) incoming
> edge;  (4, 3) normal; 
> 
> I still think testing for some magic code size is fragile at best.

So it seems we intentionally allowed some loop header copying done by
the threader based on the idea that it knows when it can do so without
a code size penalty.  The loop header copying pass has

static bool
should_duplicate_loop_header_p (basic_block header, class loop *loop,
                                int *limit)
{
  gimple_stmt_iterator bsi;

  gcc_assert (!header->aux);

  /* Loop header copying usually increases size of the code.  This used not to
     be true, since quite often it is possible to verify that the condition is
     satisfied in the first iteration and therefore to eliminate it.  Jump
     threading handles these cases now.  */
  if (optimize_loop_for_size_p (loop)
      && !loop->force_vectorize)
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file,
                 "  Not duplicating bb %i: optimizing for size.\n",
                 header->index);
      return false;

and indeed for the testcase we have

Analyzing loop 1
Loop 1 is not do-while loop: latch is not empty.
  Not duplicating bb 4: optimizing for size.

and were probably expecting jump-threading to rotate the loop.  Now, the
forward threader is set up to eventually not destroy loop info but then
loop-header copying might be able to either use ranger to tell the
looping condition is always true [and the block to copy is empty] or
loop header copying could use the jump threading path code which might be
able to determine all this(?)

That said, for GCC 12 we probably need to look to allow those threadings,
but only once and for the forward threader?  The early threader doesn't
seem to discover this opportunity for some reason.

At the point we run jt_path_registry::cancel_invalid_paths do we have
any ideas about the size of the copying we do?

So the pattern we'd allow is a threading with entry being the single
entry edge into the loop and the exit being the other edge from the
single_exit source block of it.

I wonder why the late DOM threading doesn't catch this?  We'd allow
the threading there with

  if (cfun->curr_properties & PROP_loop_opts_done)
    return false;

the late pass sees

  <bb 2> [local count: 200189151]:
  if (n_8(D) > 0)
    goto <bb 3>; [59.00%]
  else
    goto <bb 6>; [41.00%]

  <bb 3> [local count: 118111600]:
  ivtmp.9_16 = (unsigned int) array_9(D);
  _17 = (unsigned int) n_8(D);
  _18 = _17 * 4;
  _20 = ivtmp.9_16 + _18;
  goto <bb 5>; [100.00%]

  <bb 5> [local count: 1073741824]:
  # sum_5 = PHI <0(3), sum_11(4)>
  # ivtmp.9_14 = PHI <ivtmp.9_16(3), ivtmp.9_15(4)>
  if (ivtmp.9_14 != _20)
    goto <bb 4>; [89.00%]
  else
    goto <bb 6>; [11.00%]

so appearantly with IVCANON introducing a canonical IV ranger cannot
relate ivtmp.9_16 and _20 using the n_8(D) > 0 condition (OK, that
looks mightly complex, but mostly because n_8(D) * 4 is involved which
we don't know it doesn't overflow to zero).

diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 8aac733ac25..3670ec5b8d0 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2815,6 +2815,20 @@ jt_path_registry::cancel_invalid_paths
(vec<jump_thread_edge *> &path)
       && flow_loop_nested_p (exit->dest->loop_father, exit->src->loop_father))
     return false;

+  /* Allow loop header copying when we know the entry condition is
+     optimized away.  The loop header copying pass relies on those
+     being done here when optimizing for size.  */
+  edge e;
+  if (entry->dest == exit->src
+      && entry->dest->loop_father->header == entry->dest
+      /* Alternatively check for a single non-backedge pred of entry->dest. 
*/
+      && loops_state_satisfies_p (cfun, LOOPS_HAVE_PREHEADERS)
+      && (e = single_exit (entry->dest->loop_father))
+      && exit != e
+      && exit->src == e->src
+      && optimize_loop_for_size_p (entry->dest->loop_father))
+    return false;
+
   if (cfun->curr_properties & PROP_loop_opts_done)
     return false;


fixes the missing jump threading.  Possibly the predicate should be put next
to should_duplicate_loop_header_p in tree-ssa-loop-ch.c.

Reply via email to