RE: [PATCH] avoid using masked vector epilogues when no scalar epilog is needed

Liu, Hongtao Mon, 17 Nov 2025 00:19:39 -0800

> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: Monday, November 17, 2025 4:18 PM
> To: Hongtao Liu <[email protected]>
> Cc: [email protected]; Jan Hubicka <[email protected]>; Liu, Hongtao
> <[email protected]>
> Subject: Re: [PATCH] avoid using masked vector epilogues when no scalar
> epilog is needed
> 
> On Mon, 17 Nov 2025, Hongtao Liu wrote:
> 
> > On Fri, Nov 14, 2025 at 6:04 PM Richard Biener <[email protected]>
> wrote:
> > >
> > > The following arranges for avoiding masked vector epilogues when
> > > we'll eventually arrive at a vector epilogue with VF == 1 which
> > > implies no scalar epilog will be necessary.
> > >
> > > This avoids regressing performance in OpenColorIO when the
> > > avx512_masked_epilogues tuning is enabled.  A testcase for one
> > > example case is shown in PR122573.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.  The testcase
> > > depends on the SLP patch posted earlier.
> > >
> > > OK for trunk?
> > >
> > > Thanks,
> > > Richard.
> > >
> > >         PR tree-optimization/122573
> > >         * config/i386/i386.cc (ix86_vector_costs::finish_cost): Avoid
> > >         using masked epilogues when an SSE epilogue would have a VF of 
> > > one.
> > >
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr122573.c: New
> testcase.
> > > ---
> > >  gcc/config/i386/i386.cc                       |  5 ++++
> > >  .../costmodel/x86_64/costmodel-pr122573.c     | 30
> +++++++++++++++++++
> > >  2 files changed, 35 insertions(+)
> > >  create mode 100644
> > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr122573.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > > 6b6febc8870..8aac0820bc2 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -26609,6 +26609,11 @@ ix86_vector_costs::finish_cost (const
> vector_costs *scalar_costs)
> > >    if (loop_vinfo
> > >        && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> > >        && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () > 2
> > > +      /* Avoid a masked epilog if cascaded epilogues eventually get us
> > > +        to one with VF 1 as that means no scalar epilog at all.  */
> > > +      && !((GET_MODE_SIZE (loop_vinfo->vector_mode)
> > > +           / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () ==
> > > + 16)
> >
> > So LOOP_VINFO_VECT_FACTOR is the "unroll factor" of the loop to be
> > vectorized, and it's not always equal to TYPE_VECTOR_SUBPARTS.
> > Make sense.
> 
> Yes.  Is that an OK?
LGTM.
> 
> Thanks,
> Richard.
> 
> > > +          && ix86_tune_features[X86_TUNE_AVX512_TWO_EPILOGUES])
> > >        && ix86_tune_features[X86_TUNE_AVX512_MASKED_EPILOGUES]
> > >        && !OPTION_SET_P (param_vect_partial_vector_usage))
> > >      {
> > > diff --git
> > > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr122573.c
> > > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr122573.c
> > > new file mode 100644
> > > index 00000000000..ca3294dca7a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-
> pr122573.
> > > +++ c
> > > @@ -0,0 +1,30 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-march=znver5" } */
> > > +
> > > +struct S {
> > > +    float m_col1[4];
> > > +    float m_col2[4];
> > > +    float m_col3[4];
> > > +    float m_col4[4];
> > > +};
> > > +
> > > +void apply(struct S *s, const float *in, float *out, long
> > > +numPixels) {
> > > +  for (long idx = 0; idx < numPixels; ++idx)
> > > +    {
> > > +      const float r = in[0];
> > > +      const float g = in[1];
> > > +      const float b = in[2];
> > > +      const float a = in[3];
> > > +      out[0] = r*s->m_col1[0] + g*s->m_col2[0] + b*s->m_col3[0] + a*s-
> >m_col4[0];
> > > +      out[1] = r*s->m_col1[1] + g*s->m_col2[1] + b*s->m_col3[1] + a*s-
> >m_col4[1];
> > > +      out[2] = r*s->m_col1[2] + g*s->m_col2[2] + b*s->m_col3[2] + a*s-
> >m_col4[2];
> > > +      out[3] = r*s->m_col1[3] + g*s->m_col2[3] + b*s->m_col3[3] + a*s-
> >m_col4[3];
> > > +      in  += 4;
> > > +      out += 4;
> > > +    }
> > > +}
> > > +
> > > +/* Check that we do not use a masked epilog but a SSE one with VF 1
> > > +   (and possibly a AVX2 one as well).  */
> > > +/* { dg-final { scan-tree-dump "optimized: epilogue loop vectorized
> > > +using 16 byte vectors and unroll factor 1" "vect" } } */
> > > --
> > > 2.51.0
> >
> >
> >
> >
> 
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
RE: [PATCH] avoid using masked vector epilogues when no scalar epilog is needed

Reply via email to