https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102906

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 10 Nov 2021, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102906
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |amacleod at redhat dot com
> 
> --- Comment #12 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #11)
> 
> First of all, thanks for your detailed analysis.  It's very helpful.
> 
> > So it seems we intentionally allowed some loop header copying done by
> > the threader based on the idea that it knows when it can do so without
> > a code size penalty.  The loop header copying pass has
> > 
> > static bool
> > should_duplicate_loop_header_p (basic_block header, class loop *loop,
> >                                 int *limit)
> > {
> >   gimple_stmt_iterator bsi;
> > 
> >   gcc_assert (!header->aux);
> > 
> >   /* Loop header copying usually increases size of the code.  This used not
> > to
> >      be true, since quite often it is possible to verify that the condition
> > is
> >      satisfied in the first iteration and therefore to eliminate it.  Jump
> >      threading handles these cases now.  */
> >   if (optimize_loop_for_size_p (loop)
> >       && !loop->force_vectorize)
> >     {
> >       if (dump_file && (dump_flags & TDF_DETAILS))
> >         fprintf (dump_file,
> >                  "  Not duplicating bb %i: optimizing for size.\n",
> >                  header->index);
> >       return false;
> > 
> > and indeed for the testcase we have
> > 
> > Analyzing loop 1
> > Loop 1 is not do-while loop: latch is not empty.
> >   Not duplicating bb 4: optimizing for size.
> > 
> > and were probably expecting jump-threading to rotate the loop.  Now, the
> > forward threader is set up to eventually not destroy loop info but then
> > loop-header copying might be able to either use ranger to tell the
> > looping condition is always true [and the block to copy is empty] or
> > loop header copying could use the jump threading path code which might be
> > able to determine all this(?)
> 
> Hmmm, are you asking if a path through 6->4 resolves the x_7 < n_8 
> conditional?
> 
>   <bb 6> [local count: 118111600]:
>   goto <bb 4>; [100.00%]
> 
>   <bb 4> [local count: 1073741824]:
>   # sum_5 = PHI <0(6), sum_11(3)>
>   # x_7 = PHI <0(6), x_12(3)>
>   if (x_7 < n_8(D))
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 5>; [11.00%]
> 
> If so, that's trivially answered by the path solver.  We could put BB6 and BB4
> in the path, and ask it what the final conditional resolves to.
> 
> > 
> > That said, for GCC 12 we probably need to look to allow those threadings,
> > but only once and for the forward threader?  The early threader doesn't
> > seem to discover this opportunity for some reason.
> 
> It sees it, but fails to thread it because it would peel the loop.  From the
> .ethread dump:
> 
>   FAIL: path through PHI in bb4 (incoming bb:2) crosses loop
>   path: 2->4->xx REJECTED
> 
> > 
> > At the point we run jt_path_registry::cancel_invalid_paths do we have
> > any ideas about the size of the copying we do?
> 
> Not really, but we could adapt something.
> 
> The forward threader threads blindly with regards to size, but it does have
> smarts to prune paths at -Os if there will be statement duplication (look for
> optimize_function_for_size_p in *threadupdate.c).  The forward threader can 
> get
> away with not looking too hard, because it can't thread deep enough paths for
> it to matter.
> 
> On the other hand, things can get out of control with the backward threaders
> because it can find infinitely long paths.  It tracks statement costs with
> estimate_num_insns.
> 
> As usual, we have a mess of different implementations.  It would be nice to
> ultimately merge these two approaches.
> 
> > 
> > So the pattern we'd allow is a threading with entry being the single
> > entry edge into the loop and the exit being the other edge from the
> > single_exit source block of it.
> > 
> > I wonder why the late DOM threading doesn't catch this?  We'd allow
> > the threading there with
> > 
> >   if (cfun->curr_properties & PROP_loop_opts_done)
> >     return false;
> > 
> > the late pass sees
> > 
> >   <bb 2> [local count: 200189151]:
> >   if (n_8(D) > 0)
> >     goto <bb 3>; [59.00%]
> >   else
> >     goto <bb 6>; [41.00%]
> > 
> >   <bb 3> [local count: 118111600]:
> >   ivtmp.9_16 = (unsigned int) array_9(D);
> >   _17 = (unsigned int) n_8(D);
> >   _18 = _17 * 4;
> >   _20 = ivtmp.9_16 + _18;
> >   goto <bb 5>; [100.00%]
> > 
> >   <bb 5> [local count: 1073741824]:
> >   # sum_5 = PHI <0(3), sum_11(4)>
> >   # ivtmp.9_14 = PHI <ivtmp.9_16(3), ivtmp.9_15(4)>
> >   if (ivtmp.9_14 != _20)
> >     goto <bb 4>; [89.00%]
> >   else
> >     goto <bb 6>; [11.00%]
> > 
> > so appearantly with IVCANON introducing a canonical IV ranger cannot
> > relate ivtmp.9_16 and _20 using the n_8(D) > 0 condition (OK, that
> > looks mightly complex, but mostly because n_8(D) * 4 is involved which
> > we don't know it doesn't overflow to zero).
> 
> Ranger is not involved at all.  DOM is the only threader that does not use
> neither ranger nor the path solver.  The only difference in this release for
> the DOM threader is the restrictions we put in place for loop threading.
> 
> > 
> > diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
> > index 8aac733ac25..3670ec5b8d0 100644
> > --- a/gcc/tree-ssa-threadupdate.c
> > +++ b/gcc/tree-ssa-threadupdate.c
> > @@ -2815,6 +2815,20 @@ jt_path_registry::cancel_invalid_paths
> > (vec<jump_thread_edge *> &path)
> >        && flow_loop_nested_p (exit->dest->loop_father,
> > exit->src->loop_father))
> >      return false;
> >  
> > +  /* Allow loop header copying when we know the entry condition is
> > +     optimized away.  The loop header copying pass relies on those
> > +     being done here when optimizing for size.  */
> > +  edge e;
> > +  if (entry->dest == exit->src
> > +      && entry->dest->loop_father->header == entry->dest
> > +      /* Alternatively check for a single non-backedge pred of entry->dest.
> > */
> > +      && loops_state_satisfies_p (cfun, LOOPS_HAVE_PREHEADERS)
> > +      && (e = single_exit (entry->dest->loop_father))
> > +      && exit != e
> > +      && exit->src == e->src
> > +      && optimize_loop_for_size_p (entry->dest->loop_father))
> > +    return false;
> > +
> >    if (cfun->curr_properties & PROP_loop_opts_done)
> >      return false;
> >  
> > 
> > fixes the missing jump threading.  Possibly the predicate should be put next
> > to should_duplicate_loop_header_p in tree-ssa-loop-ch.c.
> 
> So, we could fix this either by relaxing the restriction with your patch, or 
> by
> teaching should_duplicate_loop_header_p that an incoming edge can resolve the
> conditional in the header?

Yes.  I think the latter would be cleaner but the former has an (ugly)
patch already (see above).  Not sure how difficult it is to do the latter,
you are likely 10x quicker than me to figure that out ;)

Reply via email to