Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

Hongtao Liu Wed, 11 Sep 2024 01:21:25 -0700

On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote:
> >
> > GCC12 enables vectorization for O2 with very cheap cost model which is 
> > restricted
> > to constant tripcount. The vectorization capacity is very limited w/ 
> > consideration
> > of codesize impact.
> >
> > The patch extends the very cheap cost model a little bit to support 
> > variable tripcount.
> > But still disable peeling for gaps/alignment, runtime aliasing checking and 
> > epilogue
> > vectorization with the consideration of codesize.
> >
> > So there're at most 2 versions of loop for O2 vectorization, one vectorized 
> > main loop
> > , one scalar/remainder loop.
> >
> > .i.e.
> >
> > void
> > foo1 (int* __restrict a, int* b, int* c, int n)
> > {
> >  for (int i = 0; i != n; i++)
> >   a[i] = b[i] + c[i];
> > }
> >
> > with -O2 -march=x86-64-v3, will be vectorized to
> >
> > .L10:
> >         vmovdqu (%r8,%rax), %ymm0
> >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> >         vmovdqu %ymm0, (%rdi,%rax)
> >         addq    $32, %rax
> >         cmpq    %rdx, %rax
> >         jne     .L10
> >         movl    %ecx, %eax
> >         andl    $-8, %eax
> >         cmpl    %eax, %ecx
> >         je      .L21
> >         vzeroupper
> > .L12:
> >         movl    (%r8,%rax,4), %edx
> >         addl    (%rsi,%rax,4), %edx
> >         movl    %edx, (%rdi,%rax,4)
> >         addq    $1, %rax
> >         cmpl    %eax, %ecx
> >         jne     .L12
> >
> > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 
> > 4.11%
> > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% 
> > with
> > extra 8.88% codesize. The details are as below
>
> I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> model numbers?
No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>
> > Performance measured with -march=x86-64-v3 -O2 on EMR
> >
> >                     N-Iter      cheap cost model
> > 500.perlbench_r     -0.12%      -0.12%
> > 502.gcc_r           0.44%       -0.11%
> > 505.mcf_r           0.17%       4.46%
> > 520.omnetpp_r       0.28%       -0.27%
> > 523.xalancbmk_r     0.00%       5.93%
> > 525.x264_r          -0.09%      23.53%
> > 531.deepsjeng_r     0.19%       0.00%
> > 541.leela_r         0.22%       0.00%
> > 548.exchange2_r     -11.54%     -22.34%
> > 557.xz_r            0.74%       0.49%
> > GEOMEAN INT         -1.04%      0.60%
> >
> > 503.bwaves_r        3.13%       4.72%
> > 507.cactuBSSN_r     1.17%       0.29%
> > 508.namd_r          0.39%       6.87%
> > 510.parest_r        3.14%       8.52%
> > 511.povray_r        0.10%       -0.20%
> > 519.lbm_r           -0.68%      10.14%
> > 521.wrf_r           68.20%      76.73%
>
> So this seems to regress as well?
Niter increases performance less than the cheap cost model, that's
expected, it is not a regression.
>
> > 526.blender_r       0.12%       0.12%
> > 527.cam4_r          19.67%      23.21%
> > 538.imagick_r       0.12%       0.24%
> > 544.nab_r           0.63%       0.53%
> > 549.fotonik3d_r     14.44%      9.43%
> > 554.roms_r          12.39%      0.00%
> > GEOMEAN FP          8.26%       9.41%
> > GEOMEAN ALL         4.11%       5.74%
> >
> > Code sise impact
> >                     N-Iter      cheap cost model
> > 500.perlbench_r     0.22%       1.03%
> > 502.gcc_r           0.25%       0.60%
> > 505.mcf_r           0.00%       32.07%
> > 520.omnetpp_r       0.09%       0.31%
> > 523.xalancbmk_r     0.08%       1.86%
> > 525.x264_r          0.75%       7.96%
> > 531.deepsjeng_r     0.72%       3.28%
> > 541.leela_r         0.18%       0.75%
> > 548.exchange2_r     8.29%       12.19%
> > 557.xz_r            0.40%       0.60%
> > GEOMEAN INT         1.07%%      5.71%
> >
> > 503.bwaves_r        12.89%      21.59%
> > 507.cactuBSSN_r     0.90%       20.19%
> > 508.namd_r          0.77%       14.75%
> > 510.parest_r        0.91%       3.91%
> > 511.povray_r        0.45%       4.08%
> > 519.lbm_r           0.00%       0.00%
> > 521.wrf_r           5.97%       12.79%
> > 526.blender_r       0.49%       3.84%
> > 527.cam4_r          1.39%       3.28%
> > 538.imagick_r       1.86%       7.78%
> > 544.nab_r           0.41%       3.00%
> > 549.fotonik3d_r     25.50%      47.47%
> > 554.roms_r          5.17%       13.01%
> > GEOMEAN FP          4.14%       11.38%
> > GEOMEAN ALL         2.80%       8.88%
> >
> >
> > The only regression is from 548.exchange_r, the vectorization for inner 
> > loop in each layer
> > of the 9-layer loops increases register pressure and causes more spill.
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> >     .....
> >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> >     ...
> > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >
> > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but 
> > x86 only has 16.
> > I have a extra patch to prevent loop vectorization in deep-depth loop for 
> > x86 backend which can
> > bring the performance back.
> >
> > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model 
> > increases codesize
> > a lot but don't imporve any performance. And N-iter is much better for that 
> > for codesize.
> >
> >
> > Any comments?
> >
> >
> > gcc/ChangeLog:
> >
> >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> >         cost model.
> >         (vect_analyze_loop): Disable epilogue vectorization in very
> >         cheap cost model.
> > ---
> >  gcc/tree-vect-loop.cc | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 242d5e2d916..06afd8cae79 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> >       a copy of the scalar code (even if we might be able to vectorize it). 
> >  */
> >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
>
> I notice that we should probably not call
> vect_enhance_data_refs_alignment because
> when alignment peeling is optional we should avoid it rather than disabling 
> the
> vectorization completely.
>
> Also if you allow peeling for niter then there's no good reason to not
> allow peeling
> for gaps (or any other epilogue peeling).
Maybe, I just want to be conservative.
>
> The extra cost for niter peeling is a runtime check before the loop
> which would also
> happen (plus keeping the scalar copy) when there's a runtime cost check.  That
> also means versioning for alias/alignment could be allowed if it
> shares the scalar
> loop with the epilogue (I don't remember the constraints we set in place for 
> the
> sharing).
Yes, but for current GCC, alias run-time check creates a separate scalar loop
https://godbolt.org/z/9seoWePKK
And enabling alias runtime check could increase too much codesize but
w/o any performance improvement.


>
> Richard.
>
> >      {
> >        if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
> > *shared)
> >                            /* No code motion support for multiple epilogues 
> > so for now
> >                               not supported when multiple exits.  */
> >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > -                        && !loop->simduid);
> > +                        && !loop->simduid
> > +                        && loop_cost_model (loop) > 
> > VECT_COST_MODEL_VERY_CHEAP);
> >    if (!vect_epilogues)
> >      return first_loop_vinfo;
> >
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao

Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

Reply via email to