On Fri, Sep 5, 2025 at 11:26 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Fri, Sep 5, 2025 at 10:11 AM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > On Fri, Sep 5, 2025 at 3:07 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Fri, Sep 5, 2025 at 4:54 AM liuhongt <hongtao....@intel.com> wrote:
> > > >
> > > > SLP may take a broadcast as kind of vec_perm, the patch checks the
> > > > permutation index to exclude those false positive.
> > > >
> > > > > Btw, you can now (in some cases) do better, namely you should
> > > > > always have 'node' available and when SLP_TREE_PERMUTE_P (node)
> > > > > then SLP_TREE_LANE_PERMUTATION could be inspected to
> > > > > detect the harmful cross-lane permutes.  Note BB vectorization
> > > > > still (always IIRC) uses SLP_TREE_LOAD_PERMUTATION,
> > > > > so for permuted loads you have a load 'node' and the permutation
> > > > > applied is visible in SLP_TREE_LOAD_PERMUTATION (which is
> > > > > a simpler data structure).  That said, BB vectorization loads
> > > > > could have harmful AVX2 permutes attached, so the patch is
> > > > > maybe a bit overzealous.
> > > > Changed.
> > > >
> > > > If we know it's a broadcast or in-lane permuation, then don't count
> > > > it into m_num_avx256_vec_perm and also exclude 0 times vec_perm which
> > > > is BIT_FIELD_REF <vect_**, 64, 0> and will be optimized to broadcast
> > > > or optimized off?
> > > >
> > > > There're adjustments for the backend to generate vpermilpd/lps instead
> > > > of vpermd/vpermq when the permutation is in-lane, that part is in the
> > > > second patch.
> > > >
> > > > I notice there're several benchmarks benifited from this patch.
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > Any comments?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > >         Check permutation index for vec_perm, don't count it if we
> > > >         know it's not a cross-lane permutation.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/i386/avx256_avoid_vec_perm.c: Adjust testcase.
> > > >         * gcc.target/i386/avx256_avoid_vec_perm-2.c: New test.
> > > > ---
> > > >  gcc/config/i386/i386.cc                       | 60 ++++++++++++++++++-
> > > >  .../gcc.target/i386/avx256_avoid_vec_perm-2.c | 21 +++++++
> > > >  .../gcc.target/i386/avx256_avoid_vec_perm-5.c | 24 ++++++++
> > > >  .../gcc.target/i386/avx256_avoid_vec_perm.c   |  2 +-
> > > >  4 files changed, 104 insertions(+), 3 deletions(-)
> > > >  create mode 100644 
> > > > gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
> > > >  create mode 100644 
> > > > gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index 55c9b16dd38..62779331774 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -26237,8 +26237,64 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > > > vect_cost_for_stmt kind,
> > > >      stmt_cost = ix86_default_vector_cost (kind, mode);
> > > >
> > > >    if (kind == vec_perm && vectype
> > > > -      && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32)
> > > > -    m_num_avx256_vec_perm[where]++;
> > > > +      && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32
> > > > +      /* BIT_FIELD_REF <vect_**, 64, 0> 0 times vec_perm costs 0 in 
> > > > body.  */
> > > > +      && count != 0)
> > >
> > > so the vectorizer costs sth withy count == 0?  I'll see to fix that,
> > > but this also
> > > means the code should have used m_num_avx256_vec_perm[where] += count.
> > >
> > > > +    {
> > > > +      bool real_perm = true;
> > > > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > +
> > > > +      if (node
> > > > +         && SLP_TREE_LOAD_PERMUTATION (node).exists ()
> > > > +         /* Loop vectorization will have 4 times vec_perm
> > > > +            with index as {0, 0, 0, 0}.
> > > > +            But it actually generates
> > > > +            vec_perm_expr <vect, vect, 0, 0, 0, 0>
> > > > +            vec_perm_expr <vect, vect, 1, 1, 1, 1>
> > > > +            vec_perm_expr <vect, vect, 2, 2, 2, 2>
> > > > +            vec_perm_expr <vect, vect, 3, 3, 3, 3>  */
> > > > +         && (!dyn_cast<loop_vec_info> (m_vinfo)
> > > > +             || count * nunits == SLP_TREE_LANES (node)))
> > >
> > > Indeed the actual permutes are not what is recorded in the 
> > > load-permutation.
> > > Instead it's the sequence of lanes to be "compressed" after loading
> > > contiguously from the DR group (or strided from the DR group).  I'll note
> > > that you should almost never see load-permutation when doing loop
> > > vectorization (remaining ones are "legacy"), and for loop vectorization
> > > a { 0 } load permutation would not be a broadcast, instead when the
> > > group size is 2 and the mode is, say, V8SI it would be {0, 2, 4, 6, 8,
> > > 10, 12, 14 }.
> > > But as said, you shouldn't see those for loop_vect.  For BB vect this
> > > complication does not exist since VF == 1 there (no unrolling).
> > >
> > > So - may I suggest to split this into loop and BB vect parts, separating 
> > > the
> > > two?  For loop vect I'd instead handle SLP_TREE_PERMUTE_P nodes
> > > which have SLP_TREE_LANE_PERMUTATION.  All lane permutations
> > > with SLP_TREE_LANES < nunits are necessarily cross-lane for loops,
> > > a splat there has % nunits == 0 lanes and all source indexes the same.
> > > There are also "good" permutes that have more than one source,
> > > those degenerating to blends.  That said, SLP_TREE_PERMUTE_P
> > > can have up to N input vectors, the SLP_TREE_LOAD_PERMUTATION
> > > permutes one or two inputs from the same vector memory load.
> >
> > Something like below? we can still share the common code to detect
> > in-lane permutation.
> > But it looks like it failed to detect in-lane permutation in
> > avx256_avoid_vec_perm-3.c since SLP_TREE_PERMUTE_P(node) is false.
>
> Yeah, that's a case we do not lower to SLP_TREE_PERMUTE_P yet ...
>
> >        if (node
> >           && SLP_TREE_LOAD_PERMUTATION (node).exists ()
> > -         /* Loop vectorization will have 4 times vec_perm
> > -            with index as {0, 0, 0, 0}.
> > -            But it actually generates
> > -            vec_perm_expr <vect, vect, 0, 0, 0, 0>
> > -            vec_perm_expr <vect, vect, 1, 1, 1, 1>
> > -            vec_perm_expr <vect, vect, 2, 2, 2, 2>
> > -            vec_perm_expr <vect, vect, 3, 3, 3, 3>  */
> >           && (!dyn_cast<loop_vec_info> (m_vinfo)
> > -             || count * nunits == SLP_TREE_LANES (node)))
> > +             || (SLP_TREE_PERMUTE_P (node)
> > +                 && SLP_TREE_LANES (node) % nunits == 0)))
>
> Note SLP_TREE_PERMUTE_P nodes never have SLP_TREE_LOAD_PERMUTATION,
> as said they have SLP_TREE_LANE_PERMUTATION instead.  So for the
> (legacy) load permutation
> case it would be
>
>   && (is_a <bb_vec_info> (m_vinfo)
>          || SLP_TREE_LANES (node) % nunits == 0)
>
> the case of SLP_TREE_PERMUTE_P would need to be added separately,
> but those are also costed as kind == vec_perm.  A common use-case were
> blends but now that we lower most load permutations to explicit
> SLP permute nodes there are also those when vectorizing loops.
>
> I guess it's reasonable to first handle SLP_TREE_LOAD_PERMUTATION,
> the other case could be done as followup.

Oh, and I'll note we of course compute the actual VEC_PERM GIMPLE stmt
permutes we'll emit during analysis (and re-compute the same during transform).
It would be an improvement to record these (and possibly re-use during
transform),
that would make those available to backend costing as well.  I'll make a mental
note to see how difficult that's to pull off.

Richard.

> > >
> > > > +       {
> > > > +         unsigned half = nunits / 2;
> > > > +         unsigned i = 0;
> > > > +         bool allsame = true;
> > > > +         unsigned first = SLP_TREE_LOAD_PERMUTATION (node)[0];
> > > > +         bool cross_lane_p = false;
> > > > +         for (i = 0 ; i != SLP_TREE_LANES (node); i++)
> > > > +           {
> > > > +             unsigned tmp = SLP_TREE_LOAD_PERMUTATION (node)[i];
> > > > +             /* allsame is just a broadcast.  */
> > > > +             if (tmp != first)
> > > > +               allsame = false;
> > > > +
> > > > +             /* 4 times vec_perm with number of lanes multiple of 
> > > > nunits.  */
> > > > +             tmp = tmp & (nunits - 1);
> > > > +             unsigned index = i & (nunits - 1);
> > > > +             if ((index < half && tmp >= half)
> > > > +                 || (index >= half && tmp < half))
> > > > +               cross_lane_p = true;
> > > > +
> > > > +             if (!allsame && cross_lane_p)
> > > > +               break;
> > > > +           }
> > > > +
> > > > +         if (i == SLP_TREE_LANES (node))
> > > > +           real_perm = false;
> > > > +       }
> > > > +
> > > > +      if (real_perm)
> > > > +       {
> > > > +         m_num_avx256_vec_perm[where]++;
> > > > +         if (dump_file && (dump_flags & TDF_DETAILS))
> > > > +           {
> > > > +             fprintf (dump_file, "Detected avx256 cross-lane 
> > > > permutation: ");
> > > > +             if (stmt_info)
> > > > +               print_gimple_expr (dump_file, stmt_info->stmt, 0, 
> > > > TDF_SLIM);
> > > > +             fprintf (dump_file, " \n");
> > > > +           }
> > > > +       }
> > > > +    }
> > > >
> > > >    /* Penalize DFmode vector operations for Bonnell.  */
> > > >    if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c 
> > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
> > > > new file mode 100644
> > > > index 00000000000..8d4e641444d
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
> > > > @@ -0,0 +1,21 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-march=sierraforest -O2 -fdump-tree-slp-details" } */
> > > > +/* { dg-final { scan-tree-dump-times {(?n)Detected avx256 cross-lane 
> > > > permutation} 1 "slp2" } } */
> > > > +
> > > > +void
> > > > +foo (double* a, double* __restrict b, int c, int n)
> > > > +{
> > > > +  a[0] = b[100] * b[2];
> > > > +  a[1] = b[100] * b[3];
> > > > +  a[2] = b[100] * b[0];
> > > > +  a[3] = b[100] * b[1];
> > > > +}
> > > > +
> > > > +void
> > > > +foo1 (double* a, double* __restrict b, int c, int n)
> > > > +{
> > > > +  a[0] = b[100] * b[0];
> > > > +  a[1] = b[100] * b[1];
> > > > +  a[2] = b[100] * b[3];
> > > > +  a[3] = b[100] * b[2];
> > > > +}
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c 
> > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
> > > > new file mode 100644
> > > > index 00000000000..c11bea8c7b3
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
> > > > @@ -0,0 +1,24 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-march=sierraforest -Ofast" } */
> > > > +/* { dg-final { scan-assembler-not {(?n)vpermpd.*%ymm} } } */
> > > > +
> > > > +typedef struct {
> > > > +  unsigned short m1, m2, m3, m4;
> > > > +} the_struct_t;
> > > > +typedef struct {
> > > > +  double m1, m2, m3, m4, m5;
> > > > +} the_struct2_t;
> > > > +
> > > > +double bar1 (the_struct2_t*);
> > > > +
> > > > +double foo (double* k, unsigned int n, the_struct_t* the_struct) {
> > > > +  unsigned int u;
> > > > +  the_struct2_t result;
> > > > +  for (u=0; u < n; u++, k--) {
> > > > +    result.m1 += (*k)*the_struct[u].m1;
> > > > +    result.m2 += (*k)*the_struct[u].m2;
> > > > +    result.m3 += (*k)*the_struct[u].m3;
> > > > +    result.m4 += (*k)*the_struct[u].m4;
> > > > +  }
> > > > +  return bar1 (&result);
> > > > +}
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c 
> > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> > > > index d4f00b3fb52..e0399041ad9 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> > > > @@ -13,7 +13,7 @@ foo (void)
> > > >        b[i*8+0] = a[i*8+0];
> > > >        b[i*8+1] = a[i*8+0];
> > > >        b[i*8+2] = a[i*8+3];
> > > > -      b[i*8+3] = a[i*8+3];
> > > > +      b[i*8+3] = a[i*8+5];
> > > >        b[i*8+4] = a[i*8+4];
> > > >        b[i*8+5] = a[i*8+6];
> > > >        b[i*8+6] = a[i*8+4];
> > > > --
> > > > 2.34.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao

Reply via email to