On Fri, Sep 5, 2025 at 11:26 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Fri, Sep 5, 2025 at 10:11 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Fri, Sep 5, 2025 at 3:07 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Fri, Sep 5, 2025 at 4:54 AM liuhongt <hongtao....@intel.com> wrote: > > > > > > > > SLP may take a broadcast as kind of vec_perm, the patch checks the > > > > permutation index to exclude those false positive. > > > > > > > > > Btw, you can now (in some cases) do better, namely you should > > > > > always have 'node' available and when SLP_TREE_PERMUTE_P (node) > > > > > then SLP_TREE_LANE_PERMUTATION could be inspected to > > > > > detect the harmful cross-lane permutes. Note BB vectorization > > > > > still (always IIRC) uses SLP_TREE_LOAD_PERMUTATION, > > > > > so for permuted loads you have a load 'node' and the permutation > > > > > applied is visible in SLP_TREE_LOAD_PERMUTATION (which is > > > > > a simpler data structure). That said, BB vectorization loads > > > > > could have harmful AVX2 permutes attached, so the patch is > > > > > maybe a bit overzealous. > > > > Changed. > > > > > > > > If we know it's a broadcast or in-lane permuation, then don't count > > > > it into m_num_avx256_vec_perm and also exclude 0 times vec_perm which > > > > is BIT_FIELD_REF <vect_**, 64, 0> and will be optimized to broadcast > > > > or optimized off? > > > > > > > > There're adjustments for the backend to generate vpermilpd/lps instead > > > > of vpermd/vpermq when the permutation is in-lane, that part is in the > > > > second patch. > > > > > > > > I notice there're several benchmarks benifited from this patch. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > > > Any comments? > > > > > > > > gcc/ChangeLog: > > > > > > > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > > > Check permutation index for vec_perm, don't count it if we > > > > know it's not a cross-lane permutation. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.target/i386/avx256_avoid_vec_perm.c: Adjust testcase. > > > > * gcc.target/i386/avx256_avoid_vec_perm-2.c: New test. > > > > --- > > > > gcc/config/i386/i386.cc | 60 ++++++++++++++++++- > > > > .../gcc.target/i386/avx256_avoid_vec_perm-2.c | 21 +++++++ > > > > .../gcc.target/i386/avx256_avoid_vec_perm-5.c | 24 ++++++++ > > > > .../gcc.target/i386/avx256_avoid_vec_perm.c | 2 +- > > > > 4 files changed, 104 insertions(+), 3 deletions(-) > > > > create mode 100644 > > > > gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c > > > > create mode 100644 > > > > gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > index 55c9b16dd38..62779331774 100644 > > > > --- a/gcc/config/i386/i386.cc > > > > +++ b/gcc/config/i386/i386.cc > > > > @@ -26237,8 +26237,64 @@ ix86_vector_costs::add_stmt_cost (int count, > > > > vect_cost_for_stmt kind, > > > > stmt_cost = ix86_default_vector_cost (kind, mode); > > > > > > > > if (kind == vec_perm && vectype > > > > - && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32) > > > > - m_num_avx256_vec_perm[where]++; > > > > + && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32 > > > > + /* BIT_FIELD_REF <vect_**, 64, 0> 0 times vec_perm costs 0 in > > > > body. */ > > > > + && count != 0) > > > > > > so the vectorizer costs sth withy count == 0? I'll see to fix that, > > > but this also > > > means the code should have used m_num_avx256_vec_perm[where] += count. > > > > > > > + { > > > > + bool real_perm = true; > > > > + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype); > > > > + > > > > + if (node > > > > + && SLP_TREE_LOAD_PERMUTATION (node).exists () > > > > + /* Loop vectorization will have 4 times vec_perm > > > > + with index as {0, 0, 0, 0}. > > > > + But it actually generates > > > > + vec_perm_expr <vect, vect, 0, 0, 0, 0> > > > > + vec_perm_expr <vect, vect, 1, 1, 1, 1> > > > > + vec_perm_expr <vect, vect, 2, 2, 2, 2> > > > > + vec_perm_expr <vect, vect, 3, 3, 3, 3> */ > > > > + && (!dyn_cast<loop_vec_info> (m_vinfo) > > > > + || count * nunits == SLP_TREE_LANES (node))) > > > > > > Indeed the actual permutes are not what is recorded in the > > > load-permutation. > > > Instead it's the sequence of lanes to be "compressed" after loading > > > contiguously from the DR group (or strided from the DR group). I'll note > > > that you should almost never see load-permutation when doing loop > > > vectorization (remaining ones are "legacy"), and for loop vectorization > > > a { 0 } load permutation would not be a broadcast, instead when the > > > group size is 2 and the mode is, say, V8SI it would be {0, 2, 4, 6, 8, > > > 10, 12, 14 }. > > > But as said, you shouldn't see those for loop_vect. For BB vect this > > > complication does not exist since VF == 1 there (no unrolling). > > > > > > So - may I suggest to split this into loop and BB vect parts, separating > > > the > > > two? For loop vect I'd instead handle SLP_TREE_PERMUTE_P nodes > > > which have SLP_TREE_LANE_PERMUTATION. All lane permutations > > > with SLP_TREE_LANES < nunits are necessarily cross-lane for loops, > > > a splat there has % nunits == 0 lanes and all source indexes the same. > > > There are also "good" permutes that have more than one source, > > > those degenerating to blends. That said, SLP_TREE_PERMUTE_P > > > can have up to N input vectors, the SLP_TREE_LOAD_PERMUTATION > > > permutes one or two inputs from the same vector memory load. > > > > Something like below? we can still share the common code to detect > > in-lane permutation. > > But it looks like it failed to detect in-lane permutation in > > avx256_avoid_vec_perm-3.c since SLP_TREE_PERMUTE_P(node) is false. > > Yeah, that's a case we do not lower to SLP_TREE_PERMUTE_P yet ... > > > if (node > > && SLP_TREE_LOAD_PERMUTATION (node).exists () > > - /* Loop vectorization will have 4 times vec_perm > > - with index as {0, 0, 0, 0}. > > - But it actually generates > > - vec_perm_expr <vect, vect, 0, 0, 0, 0> > > - vec_perm_expr <vect, vect, 1, 1, 1, 1> > > - vec_perm_expr <vect, vect, 2, 2, 2, 2> > > - vec_perm_expr <vect, vect, 3, 3, 3, 3> */ > > && (!dyn_cast<loop_vec_info> (m_vinfo) > > - || count * nunits == SLP_TREE_LANES (node))) > > + || (SLP_TREE_PERMUTE_P (node) > > + && SLP_TREE_LANES (node) % nunits == 0))) > > Note SLP_TREE_PERMUTE_P nodes never have SLP_TREE_LOAD_PERMUTATION, > as said they have SLP_TREE_LANE_PERMUTATION instead. So for the > (legacy) load permutation > case it would be > > && (is_a <bb_vec_info> (m_vinfo) > || SLP_TREE_LANES (node) % nunits == 0) > > the case of SLP_TREE_PERMUTE_P would need to be added separately, > but those are also costed as kind == vec_perm. A common use-case were > blends but now that we lower most load permutations to explicit > SLP permute nodes there are also those when vectorizing loops. > > I guess it's reasonable to first handle SLP_TREE_LOAD_PERMUTATION, > the other case could be done as followup.
Oh, and I'll note we of course compute the actual VEC_PERM GIMPLE stmt permutes we'll emit during analysis (and re-compute the same during transform). It would be an improvement to record these (and possibly re-use during transform), that would make those available to backend costing as well. I'll make a mental note to see how difficult that's to pull off. Richard. > > > > > > > + { > > > > + unsigned half = nunits / 2; > > > > + unsigned i = 0; > > > > + bool allsame = true; > > > > + unsigned first = SLP_TREE_LOAD_PERMUTATION (node)[0]; > > > > + bool cross_lane_p = false; > > > > + for (i = 0 ; i != SLP_TREE_LANES (node); i++) > > > > + { > > > > + unsigned tmp = SLP_TREE_LOAD_PERMUTATION (node)[i]; > > > > + /* allsame is just a broadcast. */ > > > > + if (tmp != first) > > > > + allsame = false; > > > > + > > > > + /* 4 times vec_perm with number of lanes multiple of > > > > nunits. */ > > > > + tmp = tmp & (nunits - 1); > > > > + unsigned index = i & (nunits - 1); > > > > + if ((index < half && tmp >= half) > > > > + || (index >= half && tmp < half)) > > > > + cross_lane_p = true; > > > > + > > > > + if (!allsame && cross_lane_p) > > > > + break; > > > > + } > > > > + > > > > + if (i == SLP_TREE_LANES (node)) > > > > + real_perm = false; > > > > + } > > > > + > > > > + if (real_perm) > > > > + { > > > > + m_num_avx256_vec_perm[where]++; > > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > > + { > > > > + fprintf (dump_file, "Detected avx256 cross-lane > > > > permutation: "); > > > > + if (stmt_info) > > > > + print_gimple_expr (dump_file, stmt_info->stmt, 0, > > > > TDF_SLIM); > > > > + fprintf (dump_file, " \n"); > > > > + } > > > > + } > > > > + } > > > > > > > > /* Penalize DFmode vector operations for Bonnell. */ > > > > if (TARGET_CPU_P (BONNELL) && kind == vector_stmt > > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c > > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c > > > > new file mode 100644 > > > > index 00000000000..8d4e641444d > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c > > > > @@ -0,0 +1,21 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-march=sierraforest -O2 -fdump-tree-slp-details" } */ > > > > +/* { dg-final { scan-tree-dump-times {(?n)Detected avx256 cross-lane > > > > permutation} 1 "slp2" } } */ > > > > + > > > > +void > > > > +foo (double* a, double* __restrict b, int c, int n) > > > > +{ > > > > + a[0] = b[100] * b[2]; > > > > + a[1] = b[100] * b[3]; > > > > + a[2] = b[100] * b[0]; > > > > + a[3] = b[100] * b[1]; > > > > +} > > > > + > > > > +void > > > > +foo1 (double* a, double* __restrict b, int c, int n) > > > > +{ > > > > + a[0] = b[100] * b[0]; > > > > + a[1] = b[100] * b[1]; > > > > + a[2] = b[100] * b[3]; > > > > + a[3] = b[100] * b[2]; > > > > +} > > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c > > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c > > > > new file mode 100644 > > > > index 00000000000..c11bea8c7b3 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c > > > > @@ -0,0 +1,24 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-march=sierraforest -Ofast" } */ > > > > +/* { dg-final { scan-assembler-not {(?n)vpermpd.*%ymm} } } */ > > > > + > > > > +typedef struct { > > > > + unsigned short m1, m2, m3, m4; > > > > +} the_struct_t; > > > > +typedef struct { > > > > + double m1, m2, m3, m4, m5; > > > > +} the_struct2_t; > > > > + > > > > +double bar1 (the_struct2_t*); > > > > + > > > > +double foo (double* k, unsigned int n, the_struct_t* the_struct) { > > > > + unsigned int u; > > > > + the_struct2_t result; > > > > + for (u=0; u < n; u++, k--) { > > > > + result.m1 += (*k)*the_struct[u].m1; > > > > + result.m2 += (*k)*the_struct[u].m2; > > > > + result.m3 += (*k)*the_struct[u].m3; > > > > + result.m4 += (*k)*the_struct[u].m4; > > > > + } > > > > + return bar1 (&result); > > > > +} > > > > diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c > > > > b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c > > > > index d4f00b3fb52..e0399041ad9 100644 > > > > --- a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c > > > > +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c > > > > @@ -13,7 +13,7 @@ foo (void) > > > > b[i*8+0] = a[i*8+0]; > > > > b[i*8+1] = a[i*8+0]; > > > > b[i*8+2] = a[i*8+3]; > > > > - b[i*8+3] = a[i*8+3]; > > > > + b[i*8+3] = a[i*8+5]; > > > > b[i*8+4] = a[i*8+4]; > > > > b[i*8+5] = a[i*8+6]; > > > > b[i*8+6] = a[i*8+4]; > > > > -- > > > > 2.34.1 > > > > > > > > > > > > -- > > BR, > > Hongtao