On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote:
> > >
> > > GCC12 enables vectorization for O2 with very cheap cost model which is 
> > > restricted
> > > to constant tripcount. The vectorization capacity is very limited w/ 
> > > consideration
> > > of codesize impact.
> > >
> > > The patch extends the very cheap cost model a little bit to support 
> > > variable tripcount.
> > > But still disable peeling for gaps/alignment, runtime aliasing checking 
> > > and epilogue
> > > vectorization with the consideration of codesize.
> > >
> > > So there're at most 2 versions of loop for O2 vectorization, one 
> > > vectorized main loop
> > > , one scalar/remainder loop.
> > >
> > > .i.e.
> > >
> > > void
> > > foo1 (int* __restrict a, int* b, int* c, int n)
> > > {
> > >  for (int i = 0; i != n; i++)
> > >   a[i] = b[i] + c[i];
> > > }
> > >
> > > with -O2 -march=x86-64-v3, will be vectorized to
> > >
> > > .L10:
> > >         vmovdqu (%r8,%rax), %ymm0
> > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > >         vmovdqu %ymm0, (%rdi,%rax)
> > >         addq    $32, %rax
> > >         cmpq    %rdx, %rax
> > >         jne     .L10
> > >         movl    %ecx, %eax
> > >         andl    $-8, %eax
> > >         cmpl    %eax, %ecx
> > >         je      .L21
> > >         vzeroupper
> > > .L12:
> > >         movl    (%r8,%rax,4), %edx
> > >         addl    (%rsi,%rax,4), %edx
> > >         movl    %edx, (%rdi,%rax,4)
> > >         addq    $1, %rax
> > >         cmpl    %eax, %ecx
> > >         jne     .L12
> > >
> > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance 
> > > by 4.11%
> > > with extra 2.8% codeisze, and cheap cost model improve performance by 
> > > 5.74% with
> > > extra 8.88% codesize. The details are as below
> >
> > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > model numbers?
> No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> >
> > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >
> > >                     N-Iter      cheap cost model
> > > 500.perlbench_r     -0.12%      -0.12%
> > > 502.gcc_r           0.44%       -0.11%
> > > 505.mcf_r           0.17%       4.46%
> > > 520.omnetpp_r       0.28%       -0.27%
> > > 523.xalancbmk_r     0.00%       5.93%
> > > 525.x264_r          -0.09%      23.53%
> > > 531.deepsjeng_r     0.19%       0.00%
> > > 541.leela_r         0.22%       0.00%
> > > 548.exchange2_r     -11.54%     -22.34%
> > > 557.xz_r            0.74%       0.49%
> > > GEOMEAN INT         -1.04%      0.60%
> > >
> > > 503.bwaves_r        3.13%       4.72%
> > > 507.cactuBSSN_r     1.17%       0.29%
> > > 508.namd_r          0.39%       6.87%
> > > 510.parest_r        3.14%       8.52%
> > > 511.povray_r        0.10%       -0.20%
> > > 519.lbm_r           -0.68%      10.14%
> > > 521.wrf_r           68.20%      76.73%
> >
> > So this seems to regress as well?
> Niter increases performance less than the cheap cost model, that's
> expected, it is not a regression.
> >
> > > 526.blender_r       0.12%       0.12%
> > > 527.cam4_r          19.67%      23.21%
> > > 538.imagick_r       0.12%       0.24%
> > > 544.nab_r           0.63%       0.53%
> > > 549.fotonik3d_r     14.44%      9.43%
> > > 554.roms_r          12.39%      0.00%
> > > GEOMEAN FP          8.26%       9.41%
> > > GEOMEAN ALL         4.11%       5.74%

I've tested the patch on aarch64, it shows similar improvement with
little codesize increasement.
I haven't tested it on other backends, but I think it would have
similar good improvements
> > >
> > > Code sise impact
> > >                     N-Iter      cheap cost model
> > > 500.perlbench_r     0.22%       1.03%
> > > 502.gcc_r           0.25%       0.60%
> > > 505.mcf_r           0.00%       32.07%
> > > 520.omnetpp_r       0.09%       0.31%
> > > 523.xalancbmk_r     0.08%       1.86%
> > > 525.x264_r          0.75%       7.96%
> > > 531.deepsjeng_r     0.72%       3.28%
> > > 541.leela_r         0.18%       0.75%
> > > 548.exchange2_r     8.29%       12.19%
> > > 557.xz_r            0.40%       0.60%
> > > GEOMEAN INT         1.07%%      5.71%
> > >
> > > 503.bwaves_r        12.89%      21.59%
> > > 507.cactuBSSN_r     0.90%       20.19%
> > > 508.namd_r          0.77%       14.75%
> > > 510.parest_r        0.91%       3.91%
> > > 511.povray_r        0.45%       4.08%
> > > 519.lbm_r           0.00%       0.00%
> > > 521.wrf_r           5.97%       12.79%
> > > 526.blender_r       0.49%       3.84%
> > > 527.cam4_r          1.39%       3.28%
> > > 538.imagick_r       1.86%       7.78%
> > > 544.nab_r           0.41%       3.00%
> > > 549.fotonik3d_r     25.50%      47.47%
> > > 554.roms_r          5.17%       13.01%
> > > GEOMEAN FP          4.14%       11.38%
> > > GEOMEAN ALL         2.80%       8.88%
> > >
> > >
> > > The only regression is from 548.exchange_r, the vectorization for inner 
> > > loop in each layer
> > > of the 9-layer loops increases register pressure and causes more spill.
> > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > >     .....
> > >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > >     ...
> > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > >
> > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, 
> > > but x86 only has 16.
> > > I have a extra patch to prevent loop vectorization in deep-depth loop for 
> > > x86 backend which can
> > > bring the performance back.
> > >
> > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model 
> > > increases codesize
> > > a lot but don't imporve any performance. And N-iter is much better for 
> > > that for codesize.
> > >
> > >
> > > Any comments?
> > >
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> > >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> > >         cost model.
> > >         (vect_analyze_loop): Disable epilogue vectorization in very
> > >         cheap cost model.
> > > ---
> > >  gcc/tree-vect-loop.cc | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 242d5e2d916..06afd8cae79 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> > >       a copy of the scalar code (even if we might be able to vectorize 
> > > it).  */
> > >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
> >
> > I notice that we should probably not call
> > vect_enhance_data_refs_alignment because
> > when alignment peeling is optional we should avoid it rather than disabling 
> > the
> > vectorization completely.
> >
> > Also if you allow peeling for niter then there's no good reason to not
> > allow peeling
> > for gaps (or any other epilogue peeling).
> Maybe, I just want to be conservative.
> >
> > The extra cost for niter peeling is a runtime check before the loop
> > which would also
> > happen (plus keeping the scalar copy) when there's a runtime cost check.  
> > That
> > also means versioning for alias/alignment could be allowed if it
> > shares the scalar
> > loop with the epilogue (I don't remember the constraints we set in place 
> > for the
> > sharing).
> Yes, but for current GCC, alias run-time check creates a separate scalar loop
> https://godbolt.org/z/9seoWePKK
> And enabling alias runtime check could increase too much codesize but
> w/o any performance improvement.
>
> >
> > Richard.
> >
> > >      {
> > >        if (dump_enabled_p ())
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, 
> > > vec_info_shared *shared)
> > >                            /* No code motion support for multiple 
> > > epilogues so for now
> > >                               not supported when multiple exits.  */
> > >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > > -                        && !loop->simduid);
> > > +                        && !loop->simduid
> > > +                        && loop_cost_model (loop) > 
> > > VECT_COST_MODEL_VERY_CHEAP);
> > >    if (!vect_epilogues)
> > >      return first_loop_vinfo;
> > >
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao



--
BR,
Hongtao

Reply via email to