Re: [RFC] Don't move cold code out of loop by checking bb count

Richard Biener via Gcc-patches Fri, 29 Oct 2021 04:48:51 -0700

On Wed, Oct 27, 2021 at 4:40 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>
>
>
> On 2021/10/26 21:20, Richard Biener wrote:
> > On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 2021/10/15 16:11, Richard Biener wrote:
> >>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 2021/9/28 20:09, Richard Biener wrote:
> >>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luo...@linux.ibm.com> 
> >>>>> wrote:
> >>>>>>
> >>>>>> Update the patch to v3, not sure whether you prefer the paste style
> >>>>>> and continue to link the previous thread as Segher dislikes this...
> >>>>>>
> >>>>>>
> >>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>>>>>
> >>>>>>
> >>>>>> Changes:
> >>>>>> 1. Handle max_loop in determine_max_movement instead of
> >>>>>> outermost_invariant_loop.
> >>>>>> 2. Remove unnecessary changes.
> >>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in 
> >>>>>> can_sm_ref_p.
> >>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it 
> >>>>>> caused
> >>>>>> infinite loop when implementing v1 and the iteration is missed to be
> >>>>>> updated actually.
> >>>>>>
> >>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >>>>>> v2: 
> >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>>>>>
> >>>>>> There was a patch trying to avoid move cold block out of loop:
> >>>>>>
> >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>>>>>
> >>>>>> Richard suggested to "never hoist anything from a bb with lower 
> >>>>>> execution
> >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
> >>>>>> before_dom_children".
> >>>>>>
> >>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >>>>>> expected target loop, if profile count of the loop bb is colder
> >>>>>> than target loop preheader, it won't be hoisted out of loop.
> >>>>>> Likely for store motion, if all locations of the REF in loop is cold,
> >>>>>> don't do store motion of it.
> >>>>>>
> >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
> >>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
> >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >>>>>> on P8LE.
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>>>>>         before motion.
> >>>>>>         (find_invariants_body): Add argument.
> >>>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>>>>>         (determine_max_movement): Use find_coldest_out_loop.
> >>>>>>         (move_computations_worker): Adjust and fix iteration udpate.
> >>>>>>         (execute_sm_exit): Check pointer validness.
> >>>>>>         (class ref_in_loop_hot_body): New functor.
> >>>>>>         (ref_in_loop_hot_body::operator): New.
> >>>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>>>>>
> >>>>>> gcc/testsuite/ChangeLog:
> >>>>>>
> >>>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >>>>>> ---
> >>>>>>  gcc/loop-invariant.c                       | 10 ++--
> >>>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>>>>>
> >>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >>>>>> index fca0c2b24be..5c3be7bf0eb 100644
> >>>>>> --- a/gcc/loop-invariant.c
> >>>>>> +++ b/gcc/loop-invariant.c
> >>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
> >>>>>> always_reached, bool always_executed)
> >>>>>>     call.  */
> >>>>>>
> >>>>>>  static void
> >>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool 
> >>>>>> always_executed)
> >>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool 
> >>>>>> always_reached,
> >>>>>> +                   bool always_executed)
> >>>>>>  {
> >>>>>>    rtx_insn *insn;
> >>>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >>>>>> +
> >>>>>> +  if (preheader->count > bb->count)
> >>>>>> +    return;
> >>>>>>
> >>>>>>    FOR_BB_INSNS (bb, insn)
> >>>>>>      {
> >>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, 
> >>>>>> basic_block *body,
> >>>>>>    unsigned i;
> >>>>>>
> >>>>>>    for (i = 0; i < loop->num_nodes; i++)
> >>>>>> -    find_invariants_bb (body[i],
> >>>>>> -                       bitmap_bit_p (always_reached, i),
> >>>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, 
> >>>>>> i),
> >>>>>>                         bitmap_bit_p (always_executed, i));
> >>>>>>  }
> >>>>>>
> >>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >>>>>> index 4b187c2cdaf..655fab03442 100644
> >>>>>> --- a/gcc/tree-ssa-loop-im.c
> >>>>>> +++ b/gcc/tree-ssa-loop-im.c
> >>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>>>>>    return ret;
> >>>>>>  }
> >>>>>>
> >>>>>> +/* Find coldest loop between outmost_loop and loop by comapring 
> >>>>>> profile count.  */
> >>>>>> +
> >>>>>> +static class loop *
> >>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >>>>>> +                      basic_block curr_bb)
> >>>>>> +{
> >>>>>> +  class loop *cold_loop, *min_loop;
> >>>>>> +  cold_loop = min_loop = outmost_loop;
> >>>>>> +  profile_count min_count = loop_preheader_edge 
> >>>>>> (min_loop)->src->count;
> >>>>>> +
> >>>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge 
> >>>>>> (loop)->src->count)
> >>>>>
> >>>>> Honza - can you comment on whether we should compare BB counts this way?
> >>>>>
> >>>>> I would suspect that for, say,
> >>>>>
> >>>>>   for (...)
> >>>>>      if (a)
> >>>>>        X;
> >>>>>      else
> >>>>>        Y;
> >>>>>
> >>>>> that the counts for X and Y will be less than that of the preheader of 
> >>>>> the loop
> >>>>> only when the loop is estimated to run once.  That is, should we really 
> >>>>> compare
> >>>>> the to the preheader count or maybe better to the _header_ count which
> >>>>> would keep the number of iterations out of the equation?
> >>>>
> >>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> >>>> loop_preheader_edge (loop)->dest, it will cause many failures in
> >>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it 
> >>>> seems
> >>>> reasonable to compare the bb count with preheader count as both gimple 
> >>>> lim
> >>>> and RTL loop-invariant move instructions to *preheader* instead of 
> >>>> *header*
> >>>> after analysis?
> >>>
> >>> Hmm, yeah - guess I was confused here.
> >>>
> >>>>>
> >>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
> >>>>> compare a count to the "IPA hot", here we're comparing two counts
> >>>>> within a function where it actually matters whether we use a<b or
> >>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> >>>>> function).
> >>>>>
> >>>>> Xionghu, you error on the side of not hoisting for unordered counts here
> >>>>>
> >>>>>> +    return NULL;
> >>>>>> +
> >>>>>> +  while (min_loop != loop)
> >>>>>> +    {
> >>>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >>>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> >>>>>
> >>>>> but in the other direction here and on the side of not hoisting
> >>>>> in ref_in_loop_hot_body.
> >>>>>
> >>>>> The three-state relational operator overloads are probably not the
> >>>>> very best idea...
> >>>>> (see profile-count.h for them)
> >>>>>
> >>>> Added new function bb_colder_than_loop_preheader to encapsulate the 
> >>>> comparision,
> >>>> if FALSE is returned due to three-state inequality,  
> >>>> find_coldest_out_loop
> >>>> will return the original input to lim->max_loop, and 
> >>>> ref_in_loop_hot_body::operator ()
> >>>> will return true to continue perform store motion, both preserve the 
> >>>> previous
> >>>> behavior.
> >>>
> >>> Thanks.  But I don't think the abstraction as written is useful:
> >>>
> >>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is 
> >>> three-state
> >>> +   as stated in profile-count.h, FALSE is returned if inequality cannot 
> >>> be
> >>> +   decided.  */
> >>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count 
> >>> count2)
> >>> +{
> >>> +  if (count1 < count2)
> >>> +    return true;
> >>> +  else
> >>> +    return false;
> >>> +}
> >>>
> >>> given the following seems to pass the preheader count in place of the BB 
> >>> count.
> >>>
> >>> +      if (bb_colder_than_loop_preheader (
> >>> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >>> +       cold_loop = min_loop;
> >>>
> >>> find_coldest_out_loop is also a bit weird, I think we want to find
> >>> the outermost loop between outmost_loop and loop that has a
> >>> lower count than the curr_bb count but
> >>>
> >>> +  while (min_loop != loop)
> >>> +    {
> >>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >>> +      if (bb_colder_than_loop_preheader (
> >>> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >>> +       cold_loop = min_loop;
> >>>
> >>> compares the outermost loops count (min_count) against the preheader
> >>> count?  So we're searching for a cold loop with respect to its enclosing 
> >>> loop
> >>> here?
> >>
> >> Let me try to explain how it works :)
> >>
> >> find_coldest_out_loop does two steps check:
> >> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
> >> just return NULL which means it should not be moved out at all;
> >> 2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest 
> >> first,
> >> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]},
> >> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
> >> that has smallest profile_count.
> >>
> >>
> >> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is 
> >> cold in
> >> loop 3, if it is cold, just return NULL, otherwise select the coldest loop 
> >> in
> >> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return 
> >> loop2 to
> >> be the target hoist loop.  The first check could AVOID hoist if curr_bb is 
> >> colder
> >> than loop3, but it is still hot than loop1 and loop2.  Not sure whether it 
> >> is possible
> >> to construct such cases?
> >>
> >>
> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>
> >> volatile int x;
> >> void
> >> bar (int, char *, char *);
> >> void
> >> foo (int *a, int n, int m, int s, int t)
> >> {
> >>   int i;
> >>   int j;
> >>   int k;
> >>
> >>   for (i = 0; i < m; i++)  // loop 1
> >>     {
> >>       if (__builtin_expect (x, 0))
> >>         for (j = 0; j < n; j++)   // loop 2
> >>           for (k = 0; k < n; k++)   // loop 3
> >>            {
> >>              bar (s / 5, "one", "two");  // curr_bb
> >>              a[t] = s;
> >>            }
> >>       a[t] = t;  // curr_bb2
> >>     }
> >> }
> >>
> >> The 4 invariant statements are moved to bb 11(loop2) instead of bb 
> >> 10(loop1)
> >> with this patch.
> >> There are totally 6 combinations when curr_bb is hotter than loop 3.  We 
> >> need
> >> to compare the "Loop preheader hotness" instead of "every Loop[i] and 
> >> curr_bb hotness",
> >> returning the coldest loop for this function find_coldest_out_loop, 
> >> otherwise
> >> unexpected behavior happens.
> >>
> >> L1 > L2 > L3   =>  return L3
> >> L1 > L3 > L2   =>  return L2
> >> L2 > L1 > L3   =>  return L3
> >> L2 > L3 > L1   =>  return L1
> >> L3 > L1 > L2   =>  return L2
> >> L3 > L2 > L1   =>  return L1
> >>
> >> So bb_colder_than_loop_preheader does two kind of checks, one is checking
> >> L3 preheader count with curr_bb count, another is checking L3 preheader 
> >> count
> >> with L1 preheader count, L2 preheader count, etc...
> >>
> >>
> >> ssa-lim-19.c.138t.lim2:
> >> ...
> >>    <bb 10> [local count: 16057869]:  // L1 preheader
> >> -  _4 = s_22(D) / 5;
> >> -  _5 = (long unsigned int) t_24(D);
> >> -  _6 = _5 * 4;
> >> -  _7 = a_25(D) + _6;
> >>    _8 = (long unsigned int) t_24(D);
> >>    _9 = _8 * 4;
> >>    _10 = a_25(D) + _9;
> >>
> >>    <bb 3> [local count: 145980626]:
> >>    # i_34 = PHI <i_29(12), 0(10)>
> >>    x.0_1 ={v} x;
> >>    if (x.0_1 != 0)
> >>      goto <bb 4>; [10.00%]
> >>    else
> >>      goto <bb 8>; [90.00%]
> >>
> >>    <bb 4> [local count: 14598063]:
> >>    if (n_20(D) > 0)
> >>      goto <bb 11>; [89.00%]
> >>    else
> >>      goto <bb 8>; [11.00%]
> >>
> >>    <bb 11> [local count: 12992276]:  // L2 preheader
> >> +  _4 = s_22(D) / 5;
> >> +  _5 = (long unsigned int) t_24(D);
> >> +  _6 = _5 * 4;
> >> +  _7 = a_25(D) + _6;
> >>    goto <bb 7>; [100.00%]
> >>
> >>    <bb 14> [local count: 850510901]:
> >>
> >>    <bb 5> [local count: 955630225]:  // curr_bb
> >>    # k_36 = PHI <k_27(14), 0(7)>
> >>    bar (_4, "one", "two");
> >>    *_7 = s_22(D);
> >>    k_27 = k_36 + 1;
> >>    if (n_20(D) > k_27)
> >>      goto <bb 14>; [89.00%]
> >>    else
> >>      goto <bb 6>; [11.00%]
> >>
> >>    <bb 6> [local count: 118111600]:
> >>    j_21 = j_35 + 1;
> >>    if (n_20(D) > j_21)
> >>      goto <bb 13>; [89.00%]
> >>    else
> >>      goto <bb 8>; [11.00%]
> >>
> >>    <bb 13> [local count: 105119324]:
> >>
> >>    <bb 7> [local count: 118111600]:   // L3 preheader
> >>    # j_35 = PHI <j_21(13), 0(11)>
> >>    goto <bb 5>; [100.00%]
> >>
> >>    <bb 8> [local count: 145980626]:
> >>    *_10 = t_24(D);
> >>    i_29 = i_34 + 1;
> >>
> >> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:
> >>
> >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is 
> >> three-state
> >> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> >> +   decided.  */
> >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count 
> >> count2)
> >> +{
> >> +  if (count1 < count2)
> >> +    return true;
> >> +  else
> >> +    return false;
> >> +}
> >> +
> >> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile 
> >> count.
> >> + */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> >> +    comparision, OUTMOST_LOOP is returned finally to preserve the 
> >> behavior.
> >> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> >> +  if (curr_bb
> >> +      && bb_colder_than_loop_preheader (curr_bb->count,
> >> +                                       loop_preheader_edge 
> >> (loop)->src->count))
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (bb_colder_than_loop_preheader (
> >> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >> +       cold_loop = min_loop;
> >> +    }
> >> +  return cold_loop;
> >> +}
> >> +
> >>
> >>
> >>>
> >>> Why is this function not simply
> >>>
> >>> +static class loop *
> >>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >>> +                      basic_block curr_bb)
> >>> +{
> >>>      while (bb_colder_than_loop_preheader (curr_bb->count,
> >>>                loop_preheader_edge (outermost_loop)->src->count))
> >>>         {
> >>>             if (outermost_loop == loop)
> >>>               return NULL;
> >>>             outermost_loop = superloop_at_depth (loop, loop_depth
> >>> (outermost_loop) + 1);
> >>>         }
> >>>      return outermost_loop;
> >>> }
> >>
> >> If change like this, when processing curr_bb(5), outermost_loop will
> >> return loop 1 since curr_bb->count > Loop1_prehead->count, the while
> >> loop stopped.  This doesn't meet what we want.
> >
> > Why?  curr_bb is executed at least as often as loop1 preheader if
> > we look at the counts?  So either the counts do not really tell us
> > anything of help or I am missing something.  Are you merely
> > looking for a block with a lower count on the path from the outermost
> > loop entry to the block in question and deciding you do not want to
> > hoist further than that?  So it's not about not hoisting to a hot place
> > but instead hoist to the coldest place within a loop nest?
> >
> > So we have
> >
> >   for (i = 0; i < m; i++)  // loop 1
> >     {
> >       if (__builtin_expect (x, 0))
> >         for (j = 0; j < n; j++)   // loop 2
> >
> >
> >    <bb 10> [local count: 16057869]:  // L1 preheader
> >        ...
> >  <bb 3> [local count: 145980626]:
> >    # i_34 = PHI <i_29(12), 0(10)>
> >  ...
> >    <bb 11> [local count: 12992276]:  // L2 preheader
> >    ...
> >     <bb 7> [local count: 118111600]:   // L3 preheader
> >    # j_35 = PHI <j_21(13), 0(11)>
> >    goto <bb 5>; [100.00%]
> >
> > and we want to hoist to the L2 preheader because that's less frequently
> > executed than the L1 preheader (which is less frequently executed
> > than the L3 preheader or the block we are hoisting from).
>
> Yes, this is exactly what I want, sorry for not describe it clear before ;(


OK, thanks for confirming ;)

> The updated patch[1] may reflect find_coldest_out_loop better:
> It first check whether curr_bb is hotter than it's preheader, if false, 
> return NULL
> which means no need hoist at all; Then find a *coldest* preheader to hoist
> within a loop nest from outmost_loop.
>
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
>
> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile 
> count.
> +   It does two steps check:
> +   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, 
> just
> +   return NULL which means it should not be moved out at all;
> +   2)  CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search 
> loops
> +   from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder
> +   than L[cold_loop], reset cold_loop to L[i] until get the loop that has
> +   smallest profile_count.  */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
> +  class loop *cold_loop;
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +                                       loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  cold_loop = loop;
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge 
> (outmost_loop)->src,
> +                                        loop_preheader_edge 
> (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 
> 1);
> +    }
> +  return cold_loop;
> +}
>
>
> >
> > I'm concerned with compile-time complexity re-evaluating counts on the
> > loop nest many times.  So it looks to me that we can pre-compute
> > this lowest-preheader-count loop for a loop nest at least for the
> > store-motion case where we know the outermost loop?
> >
> >
>
> But the lowest-preheader-count loop may change for a loop/bb with different
> outermost loop.  For example if,
>
> L1_preheader_count < L2_preheader_count < L3_preheader_count < 
> L4_preheader_count < curr_bb_count
>
> then,
>
> find_coldest_out_loop (L1, loop, curr_bb)  => coldest preheader loop is L1
> find_coldest_out_loop (L2, loop, curr_bb)  => coldest preheader loop is L2
>
> So it will be a 1:N map?

I'm talking about the can_sm_ref_p call, in that context 'loop' will
be the outermost loop of
interest, and we are calling this for all stores in a loop.  We're doing

+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return find_coldest_out_loop (l, inner_loop, curr_bb);

for each location the ref is accessed and the intent was to see
whether there's at least one
that we would like to move to 'loop'.  Indeed since we only know the
common outer loop
but not the inner we are hosting from there's not a single "coldest"
loop to cache and so
any caching we might want to perform could be applied to the other case as well.

I suppose the most natural thing to cache is for each loop the outer loop where
its outer loop preheader would be hotter than the outer loops preheader so that

+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (loop_preheader_edge
(outmost_loop)->src,
+                                        loop_preheader_edge (cold_loop)->src))
+       cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }

could be instead written as

  coldest_loop = coldest_outermost_loop[loop->num];
  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
    return outermost_loop;
  return coldest_loop;

?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
It should be possible to compute such cache in a DFS walk of the loop tree
(the loop iterator by default visits in such order).

>  Pre-compute it in find_coldest_out_loop
> and save it also in lim_data with a new variable
> coldest_preheader_loop[outmost_loop][coldest_preheader_loop]?
> each call of find_coldest_out_loop will check whether that variable is set
> already, only continue the search if
> coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set?
> Seems a bit complicated and not sure whether it helps to reduce
> compile-time complexity or I am misunderstanding...
>
>
> --
> Thanks,
> Xionghu

Re: [RFC] Don't move cold code out of loop by checking bb count

Reply via email to