On 2021/9/22 17:14, Richard Biener wrote:
On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luo...@linux.ibm.com> wrote:



On 2021/8/26 19:33, Richard Biener wrote:
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luo...@linux.ibm.com> wrote:

Hi,

On 2021/8/6 20:15, Richard Biener wrote:
On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luo...@linux.ibm.com> wrote:

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c find_invariants_bb,
if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to avoid
lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
move statement out of loop unexpectely when lim2 didn't move it.  This
change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

While I'm not familiar with the RTL invariant motion pass the patch there
looks reasonable.  Note that we should assess the profile quality
somehow - I'm not sure how to do that, CCed Honza for that.

Thanks.


For the GIMPLE part the patch looks quite complicated - but note it
probably has to be since LIM performs kind of a "CSE" on loads
(and stores for store-motion), so when there are multiple stmts
affected by a hoisting decision the biggest block count has to be
accounted.  Likewise when there are dependent stmts involved
that might include conditional stmts (a "PHI"), but the overall
cost should be looked at.

Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
couldn't be moved out of loop, also don't move it out of loop to avoid 
definition
not dominates use error.

But part 2) is obviously already done.  What I tried to say is your heuristic
doesn't integrate nicely with the pass but I admitted that it might be a bit
difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this would
work at least for the case where the immediately enclosing loop preheader
is hotter than the stmt and with this it would be a patch that's similarly
simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.

Thanks.  I found that adding a function find_coldest_out_loop and check it in
outermost_invariant_loop to find the coldest invariant loop between outermost
loop and itself could also reach the purpose.  Then the gimple code check is
redundant and could be removed.


May be I could collect the number of instructions not hoisted with the patch
on regression tests and SPEC2017 to do a estimation for "multiple stmts 
affected"
and "overall cost" need to be considered?  But it seems move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out during the 
iterations?


Now - GIMPLE LIM "costing" is somewhat backward right now
and it isn't set up to consider those multiple involved stmts.  Plus
the store-motion part does not have any cost part (but it depends
on previously decided invariant motions).

I think the way you implemented the check will cause no hoisting
to be performed instead of, say, hoisting to a different loop level
only.  Possibly shown when you consider a loop nest like

     for (;;)
       if (unlikely_cond)
         for (;;)
            invariant;

we want to hoist 'invariant' but only from the inner loop even if it
is invariant also in the outer loop.


For this case, theorotically I think the master GCC will optimize it to:

    invariant;
    for (;;)
      if (unlikely_cond)
        for (;;)
           ;

'invariant' is moved out of outer loop, but with the patch, it will get:

    for (;;)
      if (unlikely_cond)
        {
          invariant;
          for (;;)
             ;
        }

'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
so hoist it out of inner loop, this is exactly what we want, right?

Yes.  I had doubts your patch would achieve that.



The below updated patch could achieve it:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
expected target loop, then  if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

Can you split the RTL and GIMPLE changes and measure them separately
please?

I did that before and got below data, it is slightly different due to
using ratio instead of seconds, 500.perlbench_r obviously benefits
from the RTL part change, while gimple part only improves exchange2
and blender, with a regression on nab which requires the fix of loop
split,

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html

Reason is lim2 doesn't hoist code out of loop, but loop split generates
duplicated loop with incorrect profile count on preheader and header bb,
then later lim4 hoists code out of loop to unexpected place.  Same loop
shows mismatch behavior in lim2 and lim4.  With that patch, the regression
is gone.


              Gimple+RTL |  Gimple lim | RTL loop-invariant
500.perlbench_r     8.03%    0.67%    7.69%
502.gcc_r           0.56%    0.37%    0.19%
505.mcf_r           0.19%   -0.19%    0.39%
520.omnetpp_r       0.83%    0.83%    0.83%
523.xalancbmk_r    -0.78%    0.00%   -1.04%
525.x264_r          0.17%    0.00%    0.00%
531.deepsjeng_r     0.00%    0.31%    0.00%
541.leela_r         0.00%   -0.31%    0.31%
548.exchange2_r     2.08%    1.85%    0.23%
557.xz_r            0.97%    0.00%    0.65%
503.bwaves_r       -0.12%    0.00%   -0.23%
507.cactuBSSN_r     0.00%    0.14%    0.00%
508.namd_r          0.00%    0.00%    0.00%
510.parest_r       -0.16%   -0.65%    0.00%
511.povray_r        0.30%    0.91%    0.91%
519.lbm_r           0.15%    0.00%    0.00%
521.wrf_r           0.00%    0.00%   -0.80%
526.blender_r       1.84%    0.26%    0.52%
527.cam4_r          0.28%    0.00%    0.00%
538.imagick_r       0.20%    0.00%    0.00%
544.nab_r          -1.55%   -0.78%    0.00%
549.fotonik3d_r    -0.25%    0.00%    0.00%
554.roms_r         -0.84%    0.00%   -0.63%
INT GEAMEAN         1.16%    0.35%    0.90%
FLOAT GEOMEAN      -0.01%   -0.01%   -0.02%
GEOMEAN             0.50%    0.15%    0.38%


Will address other comments in later reply.  Thanks.


--
Thanks,
Xionghu

Reply via email to