On Tue, Mar 31, 2026 at 2:59 PM Robin Dapp <[email protected]> wrote: > > Hi, > > (GCC 17) > > Currently, for a situation like > > _1 = BIT_FIELD_REF <src1_9(D), 32, 0>; > _2 = BIT_FIELD_REF <src1_9(D), 32, 32>; > _3 = BIT_FIELD_REF <src1_9(D), 32, 64>; > _4 = BIT_FIELD_REF <src1_9(D), 32, 96>; > _5 = BIT_FIELD_REF <src2_16(D), 32, 0>; > _6 = BIT_FIELD_REF <src2_16(D), 32, 32>; > _7 = BIT_FIELD_REF <src2_16(D), 32, 64>; > _8 = BIT_FIELD_REF <src2_16(D), 32, 96>; > _21 = {_1, _2, _3, _4, _5, _6, _7, _8}; > > we give up because the vec_init, looking through the BIT_FIELD_REFs, sees > two source vectors but we only expect one. > > This patch adds support for a second one, leading to a permutation being > created. We can do the same optimization in forwprop but that's too > late as we have already confused vect costing by the (usually) expensive > vec_init. > > Without changes, this causes a regression on x86 (pr54400.c): > > __m128d i4 (__m128d p, __m128d q) > { > __m128d r = { p[1] + p[0], q[1] + q[0] }; > return r; > } > > The BIT_FIELD_EXPRs are expanded to vec_selects, the constructor to > vec_concat. > We recognize this in combine as the appropriate insn pattern. I took the > shotgun approach and just added all now-necessary patterns.
I wonder if we can deal with this case during RTL expansion, basically detect a concat permute and see if there's vec_init of vector components supported and use that? The alternative to new patterns would be to make vec_perm_const of the target emit a concat? Or did I misunderstand the issue? > I suppose this kind of fallout is unavoidable, in particular for small vectors > that we can "naturally" combine into others. In such situations the > two-vector/permute approach is at a disadvantage. For larger vectors, though, > the vec_select/vec_concat approach doesn't scale. > > Bootstrapped and regtested on x86, power10, and aarch64. > Regtested on riscv64. > > Regards > Robin > > PR tree-optimization/105816 > > gcc/ChangeLog: > > * tree-vect-slp.cc (compatible_bit_field_ref_source): New helper > function. > (vect_build_slp_tree_1): Use helper. > (build_perm_source_node): New helper function. > (vect_build_slp_tree_2): Use helper and allow a second source > vector. > * config/i386/sse.md (*sse3_haddv2df3_perm): New pattern to > recognize interleaved/permuted hsub/hadd. > (*sse3_hsubv2df3_perm): Ditto. > (*sse3_haddv2df3_perm_rev): Ditto. > (*sse3_haddv2df3_merge): Ditto. > (*sse3_haddv2df3_merge_rev): Ditto. > (*sse3_hsubv2df3_perm): Ditto. > (*sse3_hsubv2df3_perm_rev): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/pr105816.c: New test. > > Signed-off-by: Robin Dapp <[email protected]> > --- > gcc/config/i386/sse.md | 141 +++++++++++++++++++++++++++ > gcc/testsuite/gcc.dg/vect/pr105816.c | 23 +++++ > gcc/tree-vect-slp.cc | 139 ++++++++++++++++++-------- > 3 files changed, 263 insertions(+), 40 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/pr105816.c > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index a3f68ad9c1a..b32ea0657de 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -3844,6 +3844,147 @@ (define_insn "*sse3_hsubv2df3_low" > (set_attr "prefix" "orig,vex") > (set_attr "mode" "V2DF")]) > > +; The following patterns help recognize > +; (plus/minus interleave_even interleave_odd) > +; as h(sub/add). > +(define_insn_and_split "*sse3_haddv2df3_perm" > + [(set (match_operand:V2DF 0 "register_operand") > + (plus:V2DF > + (vec_select:V2DF > + (vec_concat:V4DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand")) > + (parallel [(const_int 0) (const_int 2)])) > + (vec_select:V2DF > + (vec_concat:V4DF (match_dup 1) (match_dup 2)) > + (parallel [(const_int 1) (const_int 3)]))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (plus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (plus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > +(define_insn_and_split "*sse3_haddv2df3_perm_rev" > + [(set (match_operand:V2DF 0 "register_operand") > + (plus:V2DF > + (vec_select:V2DF > + (vec_concat:V4DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand")) > + (parallel [(const_int 1) (const_int 3)])) > + (vec_select:V2DF > + (vec_concat:V4DF (match_dup 1) (match_dup 2)) > + (parallel [(const_int 0) (const_int 2)]))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (plus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (plus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > +(define_insn_and_split "*sse3_haddv2df3_merge" > + [(set (match_operand:V2DF 0 "register_operand") > + (plus:V2DF > + (vec_select:V2DF > + (vec_concat:V4DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand")) > + (parallel [(const_int 1) (const_int 2)])) > + (vec_merge:V2DF > + (match_dup 1) > + (match_dup 2) > + (const_int 1))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (plus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (plus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > +(define_insn_and_split "*sse3_haddv2df3_merge_rev" > + [(set (match_operand:V2DF 0 "register_operand") > + (plus:V2DF > + (vec_merge:V2DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand") > + (const_int 1)) > + (vec_select:V2DF > + (vec_concat:V4DF (match_dup 1) (match_dup 2)) > + (parallel [(const_int 1) (const_int 2)]))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (plus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (plus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > +(define_insn_and_split "*sse3_hsubv2df3_perm" > + [(set (match_operand:V2DF 0 "register_operand") > + (minus:V2DF > + (vec_select:V2DF > + (vec_concat:V4DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand")) > + (parallel [(const_int 0) (const_int 2)])) > + (vec_select:V2DF > + (vec_concat:V4DF (match_dup 1) (match_dup 2)) > + (parallel [(const_int 1) (const_int 3)]))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (minus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (minus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > +(define_insn_and_split "*sse3_hsubv2df3_perm_rev" > + [(set (match_operand:V2DF 0 "register_operand") > + (minus:V2DF > + (vec_select:V2DF > + (vec_concat:V4DF > + (match_operand:V2DF 1 "register_operand") > + (match_operand:V2DF 2 "vector_operand")) > + (parallel [(const_int 1) (const_int 3)])) > + (vec_select:V2DF > + (vec_concat:V4DF (match_dup 1) (match_dup 2)) > + (parallel [(const_int 0) (const_int 2)]))))] > + "TARGET_SSE3" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (vec_concat:V2DF > + (minus:DF > + (vec_select:DF (match_dup 1) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))) > + (minus:DF > + (vec_select:DF (match_dup 2) (parallel [(const_int 0)])) > + (vec_select:DF (match_dup 2) (parallel [(const_int 1)])))))]) > + > (define_insn "avx_h<insn>v8sf3" > [(set (match_operand:V8SF 0 "register_operand" "=x") > (vec_concat:V8SF > diff --git a/gcc/testsuite/gcc.dg/vect/pr105816.c > b/gcc/testsuite/gcc.dg/vect/pr105816.c > new file mode 100644 > index 00000000000..9f9a24aea4e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/pr105816.c > @@ -0,0 +1,23 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-additional-options "-fdump-tree-slp-details" } */ > + > +void test_lo (short * __restrict dst, short *src1, short *src2, int n) > +{ > + for (int i = 0; i < n; ++i) > + { > + dst[0] = src1[0]; > + dst[1] = src1[1]; > + dst[2] = src1[2]; > + dst[3] = src1[3]; > + dst[4] = src2[0]; > + dst[5] = src2[1]; > + dst[6] = src2[2]; > + dst[7] = src2[3]; > + dst+=8; > + src1+=4; > + src2+=4; > + } > +} > + > +/* { dg-final { scan-tree-dump-not "different BIT_FIELD_REF arguments" > "slp1" } } */ > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 8fa6a740c96..e71118bb56f 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -1088,6 +1088,23 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info > stmt_info, > return true; > } > > +/* Return true if VEC is an SSA name, of compatible vector type and its > + type of the same size as VECTYPE. */ > + > +static bool > +compatible_bit_field_ref_source (vec_info *vinfo, tree vec, tree vectype) > +{ > + return is_a <bb_vec_info> (vinfo) I'd like to see the bb_vec_info check ... > + && TREE_CODE (vec) == SSA_NAME > + /* When the element types are not compatible we pun the > + source to the target vectype which requires equal size. */ > + && ((VECTOR_TYPE_P (TREE_TYPE (vec)) > + && types_compatible_p (TREE_TYPE (vectype), > + TREE_TYPE (TREE_TYPE (vec)))) > + || operand_equal_p (TYPE_SIZE (vectype), > + TYPE_SIZE (TREE_TYPE (vec)))); > +} > + > /* Verify if the scalar stmts STMTS are isomorphic, require data > permutation or are of unsupported types of operation. Return > true if they are, otherwise return false and indicate in *MATCHES > @@ -1122,6 +1139,7 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char > *swap, > int first_reduc_idx = -1; > bool maybe_soft_fail = false; > tree soft_fail_nunits_vectype = NULL_TREE; > + tree other_bfref_source = NULL_TREE; > > tree vectype, nunits_vectype; > if (!vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype, > @@ -1316,15 +1334,7 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char > *swap, > && rhs_code == BIT_FIELD_REF) > { > tree vec = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0); > - if (!is_a <bb_vec_info> (vinfo) > - || TREE_CODE (vec) != SSA_NAME > - /* When the element types are not compatible we pun the > - source to the target vectype which requires equal size. > */ > - || ((!VECTOR_TYPE_P (TREE_TYPE (vec)) > - || !types_compatible_p (TREE_TYPE (vectype), > - TREE_TYPE (TREE_TYPE (vec)))) > - && !operand_equal_p (TYPE_SIZE (vectype), > - TYPE_SIZE (TREE_TYPE (vec))))) ... here, because it isn't really relevant to compatibility and it helps seeing what is special to BB vect here. Otherwise looks good to me, but I cannot approve the x86 parts (and as said, I'm not sure it is the appropriate solution). Thanks, Richard. > + if (!compatible_bit_field_ref_source (vinfo, vec, vectype)) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -1422,17 +1432,40 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char > *swap, > continue; > } > > + /* Check if the number of distinct BIT_FIELD_REF source operands > + is at most 2. */ > if (!ldst_p > && first_stmt_code == BIT_FIELD_REF > && (TREE_OPERAND (gimple_assign_rhs1 (first_stmt_info->stmt), 0) > != TREE_OPERAND (gimple_assign_rhs1 (stmt_info->stmt), 0))) > { > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "Build SLP failed: different BIT_FIELD_REF " > - "arguments in %G", stmt); > - /* Mismatch. */ > - continue; > + tree cur = TREE_OPERAND (gimple_assign_rhs1 (stmt_info->stmt), > 0); > + if (!other_bfref_source) > + { > + if (!compatible_bit_field_ref_source (vinfo, cur, vectype)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, > + vect_location, > + "Build SLP failed: different " > + "BIT_FIELD_REF arguments in " > + "%G", stmt); > + /* Mismatch. */ > + continue; > + } > + other_bfref_source = cur; > + } > + else if (cur != other_bfref_source) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, > + vect_location, > + "Build SLP failed: different " > + "BIT_FIELD_REF arguments in " > + "%G", stmt); > + /* Mismatch. */ > + continue; > + } > } > > if (call_stmt > @@ -1962,6 +1995,40 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree > vectype, > SLP_TREE_CHILDREN (perm).quick_push (child2); > } > > +/* Helper for creating a permutation source node when converting a > + vec[0] = BIT_FIELD_REF (vec, ...) > + into a vector permutation. > + > + Initializes an SLP_TREE with the type of VECTYPE, sets its number of > units, > + and adds VEC to its VEC_DEFs, returning the new node. */ > + > +static slp_tree > +build_perm_source_node (tree vec, tree vectype) > +{ > + slp_tree vnode = vect_create_new_slp_node (vNULL); > + if (operand_equal_p (TYPE_SIZE (vectype), TYPE_SIZE (TREE_TYPE (vec)))) > + /* ??? We record vectype here but we hide eventually necessary > + punning and instead rely on code generation to materialize > + VIEW_CONVERT_EXPRs as necessary. We instead should make > + this explicit somehow. */ > + SLP_TREE_VECTYPE (vnode) = vectype; > + else > + { > + /* For different size but compatible elements we can still > + use VEC_PERM_EXPR without punning. */ > + gcc_assert (VECTOR_TYPE_P (TREE_TYPE (vec)) > + && types_compatible_p (TREE_TYPE (vectype), > + TREE_TYPE (TREE_TYPE (vec)))); > + SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec); > + } > + auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode)); > + unsigned HOST_WIDE_INT const_nunits; > + if (nunits.is_constant (&const_nunits)) > + SLP_TREE_LANES (vnode) = const_nunits; > + SLP_TREE_VEC_DEFS (vnode).safe_push (vec); > + return vnode; > +} > + > /* Recursively build an SLP tree starting from NODE. > Fail (and return a value not equal to zero) if def-stmts are not > isomorphic, require data permutation or are of unsupported types of > @@ -2195,15 +2262,19 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, > && !gimple_vuse (stmt_info->stmt) > && gimple_assign_rhs_code (stmt_info->stmt) == BIT_FIELD_REF) > { > - /* vect_build_slp_tree_2 determined all BIT_FIELD_REFs reference > - the same SSA name vector of a compatible type to vectype. */ > + /* vect_build_slp_tree_1 determined all BIT_FIELD_REFs reference > + at most two SSA name vectors of a compatible type to vectype. */ > vec<std::pair<unsigned, unsigned> > lperm = vNULL; > tree vec = TREE_OPERAND (gimple_assign_rhs1 (stmt_info->stmt), 0); > + tree other_vec = NULL_TREE; > stmt_vec_info estmt_info; > FOR_EACH_VEC_ELT (stmts, i, estmt_info) > { > gassign *estmt = as_a <gassign *> (estmt_info->stmt); > tree bfref = gimple_assign_rhs1 (estmt); > + tree cur_vec = TREE_OPERAND (bfref, 0); > + if (cur_vec != vec) > + other_vec = cur_vec; > HOST_WIDE_INT lane; > if (!known_eq (bit_field_size (bfref), > tree_to_poly_uint64 (TYPE_SIZE (TREE_TYPE > (vectype)))) > @@ -2214,39 +2285,27 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, > matches[0] = false; > return NULL; > } > - lperm.safe_push (std::make_pair (0, (unsigned)lane)); > + lperm.safe_push (std::make_pair (cur_vec == vec ? 0 : 1, > + (unsigned) lane)); > } > - slp_tree vnode = vect_create_new_slp_node (vNULL); > - if (operand_equal_p (TYPE_SIZE (vectype), TYPE_SIZE (TREE_TYPE (vec)))) > - /* ??? We record vectype here but we hide eventually necessary > - punning and instead rely on code generation to materialize > - VIEW_CONVERT_EXPRs as necessary. We instead should make > - this explicit somehow. */ > - SLP_TREE_VECTYPE (vnode) = vectype; > - else > - { > - /* For different size but compatible elements we can still > - use VEC_PERM_EXPR without punning. */ > - gcc_assert (VECTOR_TYPE_P (TREE_TYPE (vec)) > - && types_compatible_p (TREE_TYPE (vectype), > - TREE_TYPE (TREE_TYPE (vec)))); > - SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec); > - } > - auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode)); > - unsigned HOST_WIDE_INT const_nunits; > - if (nunits.is_constant (&const_nunits)) > - SLP_TREE_LANES (vnode) = const_nunits; > - SLP_TREE_VEC_DEFS (vnode).safe_push (vec); > + slp_tree vnode = build_perm_source_node (vec, vectype); > + slp_tree vnode2 = NULL; > + if (other_vec) > + vnode2 = build_perm_source_node (other_vec, vectype); > + > /* We are always building a permutation node even if it is an identity > permute to shield the rest of the vectorizer from the odd node > representing an actual vector without any scalar ops. > ??? We could hide it completely with making the permute node > external? */ > - node = vect_create_new_slp_node (node, stmts, 1); > + node = vect_create_new_slp_node (node, stmts, !other_vec ? 1 : 2); > SLP_TREE_CODE (node) = VEC_PERM_EXPR; > SLP_TREE_LANE_PERMUTATION (node) = lperm; > SLP_TREE_VECTYPE (node) = vectype; > SLP_TREE_CHILDREN (node).quick_push (vnode); > + if (other_vec) > + SLP_TREE_CHILDREN (node).quick_push (vnode2); > + > return node; > } > /* When discovery reaches an associatable operation see whether we can > -- > 2.53.0
