Re: [RFC] Don't move cold code out of loop by checking bb count

Xionghu Luo via Gcc-patches Wed, 27 Oct 2021 18:50:10 -0700

On 2021/10/27 20:54, Jan Hubicka wrote:
>> Hi,
>>
>> On 2021/9/28 20:09, Richard Biener wrote:
>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <[email protected]> wrote:
>>>>
>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>> and continue to link the previous thread as Segher dislikes this...
>>>>
>>>>
>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>
>>>>
>>>> Changes:
>>>> 1. Handle max_loop in determine_max_movement instead of
>>>> outermost_invariant_loop.
>>>> 2. Remove unnecessary changes.
>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in 
>>>> can_sm_ref_p.
>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>> updated actually.
>>>>
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>> expected target loop, if profile count of the loop bb is colder
>>>> than target loop preheader, it won't be hoisted out of loop.
>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>> don't do store motion of it.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>         before motion.
>>>>         (find_invariants_body): Add argument.
>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>         (execute_sm_exit): Check pointer validness.
>>>>         (class ref_in_loop_hot_body): New functor.
>>>>         (ref_in_loop_hot_body::operator): New.
>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>> ---
>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>
>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>> --- a/gcc/loop-invariant.c
>>>> +++ b/gcc/loop-invariant.c
>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
>>>> always_reached, bool always_executed)
>>>>     call.  */
>>>>
>>>>  static void
>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool 
>>>> always_executed)
>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>> +                   bool always_executed)
>>>>  {
>>>>    rtx_insn *insn;
>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>> +
>>>> +  if (preheader->count > bb->count)
>>>> +    return;
>>>>
>>>>    FOR_BB_INSNS (bb, insn)
>>>>      {
>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block 
>>>> *body,
>>>>    unsigned i;
>>>>
>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>> -    find_invariants_bb (body[i],
>>>> -                       bitmap_bit_p (always_reached, i),
>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>                         bitmap_bit_p (always_executed, i));
>>>>  }
>>>>
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index 4b187c2cdaf..655fab03442 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>    return ret;
>>>>  }
>>>>
>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile 
>>>> count.  */
>>>> +
>>>> +static class loop *
>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>> +                      basic_block curr_bb)
>>>> +{
>>>> +  class loop *cold_loop, *min_loop;
>>>> +  cold_loop = min_loop = outmost_loop;
>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>> +
>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>
>>> Honza - can you comment on whether we should compare BB counts this way?
>>>
>>> I would suspect that for, say,
>>>
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
>>>
>>> that the counts for X and Y will be less than that of the preheader of the 
>>> loop
>>> only when the loop is estimated to run once.  That is, should we really 
>>> compare
>>> the to the preheader count or maybe better to the _header_ count which
>>> would keep the number of iterations out of the equation?
>>
>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>> loop_preheader_edge (loop)->dest, it will cause many failures in
>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>> reasonable to compare the bb count with preheader count as both gimple lim
>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>> after analysis?
> 
> I am not quite sure I understand what you shoot for.  But if you have
> loop invariant inside a loop nest and you get range of loops in the nest
> where you want to move it, you want to pick chepaer preheader count,
> since the statement is going to be executed there.
> 
> For
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
> 
> You may have frequency of X less then preheader i.e. when probability
> that a is true is lower than the expected iteration count.
> 
> If I understand correctly, you want to compare sum of counts of all
> BBs where invariant evaulates currently to the minimal count of
> preheader where you can move it.
> 
> If you have
> 
> for A
>   for B
>     for C
>       invariant_computation
> 
> Usually you want to move it:
> 
> invariant_computation
> for A
>   for B
>     for C
> 
> However if for B usually iterates 0 times, it may happen that preheader
> of for C is executed less often then preheaders of for A/B and you want:
> 
> for A
>   for B
>     invariant_computation
>     for C

Thanks, this is what I am trying to do in both gimple lim and RTL 
loop-invariant motion. 

In gimple lim, the new added function find_coldest_out_loop[1] will check 
whether 
invariant_computation is hotter than C_preheader, if true, find a coldest
preheader from outermost nested loop, if B is the coldest, reset the 
outmost_loop
to B, which could avoid hoist cold statement to hot loops to reduce execution 
counts.
Gimple only change could improve 500.perlbench_r and 548.exchange2_r a bit[2].

RTL patch need only small check like below, it could improve performance
500.perlbench_r ~8% [2]for at least Power and aarch64.


[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580110.html


This is the other patch Richard and I expecting your review :)


>From 468e0b252a6b4a8b648c4a49850ed337ab5e03e1 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <[email protected]>
Date: Fri, 8 Oct 2021 22:05:39 -0500
Subject: [PATCH v4 2/2] loop-invariant: Don't move cold bb instructions to 
preheader in RTL

gcc/ChangeLog:

        * loop-invariant.c (find_invariants_bb): Check profile count
        before motion.
        (find_invariants_body): Add argument.
---
 gcc/loop-invariant.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index fca0c2b24be..5c3be7bf0eb 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
always_reached, bool always_executed)
    call.  */
 
 static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+                   bool always_executed)
 {
   rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
 
   FOR_BB_INSNS (bb, insn)
     {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
   unsigned i;
 
   for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-                       bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
                        bitmap_bit_p (always_executed, i));
 }
 
-- 
2.27.0.90.geebb51ba8c



>>
>>>
>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
>>> compare a count to the "IPA hot", here we're comparing two counts
>>> within a function where it actually matters whether we use a<b or
>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
>>> function).
>>>
>>> Xionghu, you error on the side of not hoisting for unordered counts here
>>>
>>>> +    return NULL;
>>>> +
>>>> +  while (min_loop != loop)
>>>> +    {
>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
>>>
>>> but in the other direction here and on the side of not hoisting
>>> in ref_in_loop_hot_body.
>>>
>>> The three-state relational operator overloads are probably not the
>>> very best idea...
>>> (see profile-count.h for them)
> 
> In first version of the patch I had
>  count1.known_le (count2)
> which however made the code to look quite ugly and eventually I
> convinced myself that three-state comparators are less pain than hard to
> read conditionals...
> 
> But i guess we can ensapsulate them when it makes code easier to read. I
> would be OK with having known_XY comparator variants in profile-count.h

I did add a new function to ensapsulate it in tree-ssa-loop-im.c, we could
replace it with known_le if needed:


+/* Compare the profile count inequality of bb and preheader, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
+{
+  gcc_assert (bb && preheader);
+  return bb->count < preheader->count;
+}


> 
> Honza
> 

-- 
Thanks,
Xionghu
Re: [RFC] Don't move cold code out of loop by checking bb count

Reply via email to