The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes, one for each loaded lane and a single store node plus a single permute node feeding it. For
for (int i = 0; i < 1024; ++i) { a[2*i] = b[2*i] + 7; a[2*i+1] = b[2*i+1] * 3; } you have the following SLP graph where I explain how things are set up and code-generated: t.c:3:21: note: SLP graph after lowering permutations: t.c:3:21: note: node 0x578af60 (max_nunits=1, refcnt=1) vector([4,4]) int t.c:3:21: note: op template: *_6 = _7; t.c:3:21: note: stmt 0 *_6 = _7; t.c:3:21: note: stmt 1 *_12 = _13; t.c:3:21: note: children 0x578aff8 This is the store node, it's marked with ldst_lanes = true during SLP discovery. This node code-generates .MASK_LEN_STORE_LANES (vectp_a.10_30, 32B, { -1, ... }, loop_len_38, 0, vect_array.9); t.c:3:21: note: node 0x578aff8 (max_nunits=1, refcnt=1) vector([4,4]) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: { } t.c:3:21: note: lane permutation { 0[0] 1[0] } t.c:3:21: note: children 0x578ab38 0x578ad98 This is the store-lane shuffling node, it's also marked with ldst_lanes = true during SLP discovery. This node code-generates vect_array.9[0] = vect__7.7_35; vect_array.9[1] = vect__13.8_33; it records virtual defs as vectorized defs. This is a bit awkward, on the store side it would be nicer to elide 'vect_array' and instead make .MASK_LEN_STORE_LANES "varargs", having the vector defs as arguments. t.c:3:21: note: node 0x578ab38 (max_nunits=4, refcnt=2) vector([4,4]) int t.c:3:21: note: op template: _7 = _5 + 7; t.c:3:21: note: stmt 0 _7 = _5 + 7; t.c:3:21: note: children 0x578abd0 0x578ac68 t.c:3:21: note: node (constant) 0x578ac68 (max_nunits=1, refcnt=1) t.c:3:21: note: { 7 } t.c:3:21: note: node 0x578b978 (max_nunits=2, refcnt=2) vector(2) int t.c:3:21: note: op template: _13 = _11 * 3; t.c:3:21: note: stmt 0 _13 = _11 * 3; t.c:3:21: note: children 0x578b4b8 0x578b8e0 t.c:3:21: note: node (constant) 0x578b8e0 (max_nunits=1, refcnt=1) t.c:3:21: note: { 3 } t.c:3:21: note: node 0x578abd0 (max_nunits=4, refcnt=2) vector([4,4]) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: stmt 0 _5 = *_4; t.c:3:21: note: lane permutation { 0[0] } t.c:3:21: note: children 0x578b090 t.c:3:21: note: node 0x578b4b8 (max_nunits=2, refcnt=2) vector(2) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: stmt 0 _11 = *_10; t.c:3:21: note: lane permutation { 0[1] } t.c:3:21: note: children 0x578bbd8 The above two are the load-lane shufflings, they are marked with ldst_lanes during load permutation lowering. They code generate _34 = vect_array.6[1]; _36 = vect_array.6[0]; they pick up the vect_array.6 by looking at the virtual operand def in the children node. As with the store side this is awkward but much harder to avoid since we lack multiple SSA defs support. What could be done though is to write vect_array.6 into SSA and use BIT_FIELD_REF for the extracts? (yes, that would require ARRAY_TYPE gimple registers) t.c:3:21: note: node 0x578b090 (max_nunits=4, refcnt=3) vector([4,4]) int t.c:3:21: note: op template: _5 = *_4; t.c:3:21: note: stmt 0 _5 = *_4; t.c:3:21: note: stmt 1 _11 = *_10; t.c:3:21: note: load permutation { 0 1 } This is the load itself, marked ldst_lanes during load permutation lowering. It code generates vect_array.6 = .MASK_LEN_LOAD_LANES (vectp_b.4_40, 32B, { -1, ... }, loop_len_38, 0); The patch below (tested with just the above testcase and on riscv) runs into the following where I have not figured out what's wrong. t.c:1:6: internal compiler error: Segmentation fault 1 | void foo (int * __restrict a, int *b) | ^~~ 0x3ea25f5 internal_error(char const*, ...) /home/rguenther/src/gcc/gcc/diagnostic-global-context.cc:491 0x1ac12b6 crash_signal /home/rguenther/src/gcc/gcc/toplev.cc:319 0x10af391 tree_check(tree_node const*, char const*, int, char const*, tree_code) /home/rguenther/src/gcc/gcc/tree.h:3919 0x1140a0a TYPE_VECTOR_SUBPARTS(tree_node const*) /home/rguenther/src/gcc/gcc/tree.h:4254 0x1e7ab36 vect_set_loop_controls_directly /home/rguenther/src/gcc/gcc/tree-vect-loop-manip.cc:514 The patch gets ncopies wrong (I'm sure of that), and the way I create virtual operands is surely going to break things. It likely also gets the support for doing load lanes for multi-lane SLP sub-groups wrong. The way this SLP representation forces us to separate IFN and array access code generation to be separated introduces problems not present with non-SLP. Keeping the operation in one node would be managable on the store side by recording an incoming lane permutation for it. On the load side we could code generate everything from the load node and make a special contract between the "extract" and the load (but I suppose that's already present with the patch below), it would then become just a forwarder doing no code generation. I'm posting this as-is, I will probably try the alternative to decide which is more ugly. I guess I'm hoping for magic other ideas here ;) Thanks, Richard. * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark load, store and permute nodes. * tree-vect-slp.cc (vect_build_slp_instance): For stores iff the target prefers store-lanes discover single-lane sub-groups, do not perform interleaving lowering but mark the nodes with ldst_lanes. (vect_lower_load_permutations): When the target supports load lanes and the loads all fit the pattern split out a single level of permutes only and mark the load and permute nodes with ldst_lanes. (vectorizable_slp_permutation_1): Code-generate the load-/store-lane permute parts. * tree-vect-stmts.cc (get_group_load_store_type): Support load/store-lanes for SLP. (vectorizable_store): Hack code generation for store-lanes. (vectorizable_load): Hack code generation for load-lanes. --- gcc/tree-vect-slp.cc | 146 +++++++++++++++++++++++++++++++++++++++-- gcc/tree-vect-stmts.cc | 63 ++++++++++++++---- gcc/tree-vectorizer.h | 3 + 3 files changed, 192 insertions(+), 20 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 2dc6d365303..14a3e166e73 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -120,6 +120,7 @@ _slp_tree::_slp_tree () SLP_TREE_SIMD_CLONE_INFO (this) = vNULL; SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def; SLP_TREE_CODE (this) = ERROR_MARK; + this->ldst_lanes = false; SLP_TREE_VECTYPE (this) = NULL_TREE; SLP_TREE_REPRESENTATIVE (this) = NULL; SLP_TREE_REF_COUNT (this) = 1; @@ -3601,9 +3602,22 @@ vect_build_slp_instance (vec_info *vinfo, size >= 1. */ else if (is_a <loop_vec_info> (vinfo) && (i > 0 && i < group_size) - && !vect_slp_prefer_store_lanes_p (vinfo, - stmt_info, group_size, i)) - { + /*&& !vect_slp_prefer_store_lanes_p (vinfo, + stmt_info, group_size, i)*/) + { + /* There are targets that cannot do even/odd interleaving schemes + so they absolutely need to use load/store-lanes. For now + force single-lane SLP for them - they would be happy with + uniform power-of-two lanes (but depending on element size), + but even if we can use 'i' as indicator we would need to + backtrack when later lanes fail to discover with the same + granularity. */ + bool want_store_lanes + = vect_slp_prefer_store_lanes_p (vinfo, + stmt_info, group_size, 1 /*i*/); + if (want_store_lanes) + i = 1; + if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Splitting SLP group at stmt %u\n", i); @@ -3637,7 +3651,10 @@ vect_build_slp_instance (vec_info *vinfo, (max_nunits, end - start)); rhs_nodes.safe_push (node); start = end; - end = group_size; + if (want_store_lanes) + end = start + 1; + else + end = group_size; } else { @@ -3711,7 +3728,14 @@ vect_build_slp_instance (vec_info *vinfo, we end up with two inputs. That scheme makes sure we end up with permutes satisfying the restriction of requiring at most two vector inputs to produce a single vector output - when the number of lanes is even. */ + when the number of lanes is even. + Leave the merge as is when we want to use store-lanes. */ + if (want_store_lanes) + { + perm->ldst_lanes = 1; + node->ldst_lanes = 1; + } + else while (SLP_TREE_CHILDREN (perm).length () > 2) { /* When we have three equal sized groups left the pairwise @@ -4057,6 +4081,36 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, if (exact_log2 (group_lanes) == -1 && group_lanes != 3) return; + /* Verify if all load permutations can be implemented with a suitably + large element load-lanes operation. */ + unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]); + if (exact_log2 (ld_lanes_lanes) == -1 + /* ??? Need to handle ld_lanes_lanes != 1 correctly. */ + || !vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]), + group_lanes, false)) + ld_lanes_lanes = 0; + else + for (slp_tree load : loads) + { + if (SLP_TREE_LANES (load) != ld_lanes_lanes) + { + ld_lanes_lanes = 0; + break; + } + unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0]; + if (first % ld_lanes_lanes != 0) + { + ld_lanes_lanes = 0; + break; + } + for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i) + if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i) + { + ld_lanes_lanes = 0; + break; + } + } + for (slp_tree load : loads) { /* Leave masked or gather loads alone for now. */ @@ -4071,7 +4125,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, with a non-1:1 load permutation around instead of canonicalizing those into a load and a permute node. Removing this early check would do such canonicalization. */ - if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2 + && ld_lanes_lanes == 0) continue; /* First build (and possibly re-use) a load node for the @@ -4104,6 +4159,12 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, final_perm.quick_push (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + if (ld_lanes_lanes != 0) + { + l0->ldst_lanes = true; + load->ldst_lanes = true; + } + else while (1) { unsigned group_lanes = SLP_TREE_LANES (l0); @@ -9705,6 +9766,8 @@ vect_add_slp_permutation (vec_info *vinfo, gimple_stmt_iterator *gsi, node->push_vec_def (perm_stmt); } +extern tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems); + /* Subroutine of vectorizable_slp_permutation. Check whether the target can perform permutation PERM on the (1 or 2) input nodes in CHILDREN. If GSI is nonnull, emit the permutation there. @@ -9758,6 +9821,77 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, gcc_assert (perm.length () == SLP_TREE_LANES (node)); + /* Load/store-lanes permute. */ + if (node->ldst_lanes) + { + if (!gsi) + /* ??? We should have this flag set on the permute before a store + or the permutes after a load. Check vect_load_lanes_supported + or leave that to the store? */ + return 1; + /* For the load case there should be a single child and + it's vector def should be the virtual SSA def of the + .LOAD_LANES call. Create the vect__5.7_48 = vect_array.6[n] + load here plus punning of the vector type (in case an element + represents multiple lanes). */ + if (children.length () == 1) + { + slp_tree child = children[0]; + unsigned vec_num = (SLP_TREE_LANE_PERMUTATION (node)[0].second + / SLP_TREE_LANES (node)); + for (tree def : SLP_TREE_VEC_DEFS (child)) + { + gcc_assert (virtual_operand_p (def)); + tree decl = gimple_call_lhs (SSA_NAME_DEF_STMT (def)); + tree vec = make_ssa_name (TREE_TYPE (TREE_TYPE (decl))); + gassign *new_stmt + = gimple_build_assign (vec, + build4 (ARRAY_REF, TREE_TYPE (vec), + decl, build_int_cstu + (unsigned_type_node, vec_num), + NULL_TREE, NULL_TREE)); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + if (!useless_type_conversion_p (vectype, TREE_TYPE (vec))) + { + vec = make_ssa_name (vectype); + new_stmt = gimple_build_assign (vec, build1 (VIEW_CONVERT_EXPR, vectype, gimple_assign_lhs (new_stmt))); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + } + node->push_vec_def (vec); + } + return SLP_TREE_VEC_DEFS (node).length (); + } + /* For the store case create the array and the set of stores from + all children vector defs. The vector def should then be the + virtual def of the last store into the array. */ + else + { + unsigned vec_num = SLP_TREE_VEC_DEFS (children[0]).length (); + for (unsigned i = 0; i < vec_num; ++i) + { + tree decl = create_vector_array (op_vectype, + SLP_TREE_LANES (node) + / SLP_TREE_LANES (children[0])); + tree last_vdef = NULL_TREE; + for (unsigned j = 0; j < children.length (); ++j) + { + slp_tree child = children[j]; + tree ref = build4 (ARRAY_REF, op_vectype, decl, + build_int_cstu (unsigned_type_node, j), + NULL_TREE, NULL_TREE); + gimple *new_stmt + = gimple_build_assign (ref, SLP_TREE_VEC_DEFS (child)[i]); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + last_vdef = make_ssa_name (gimple_vop (cfun)); + SSA_NAME_DEF_STMT (last_vdef) = new_stmt; + gimple_set_vdef (new_stmt, last_vdef); + } + node->push_vec_def (last_vdef); + } + return vec_num; + } + } + /* REPEATING_P is true if every output vector is guaranteed to use the same permute vector. We can handle that case for both variable-length and constant-length vectors, but we only handle other cases for diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index fdcda0d2aba..ee28accca8c 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -146,7 +146,7 @@ record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int count, /* Return a variable of type ELEM_TYPE[NELEMS]. */ -static tree +tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems) { return create_tmp_var (build_array_type_nelts (elem_type, nelems), @@ -2069,6 +2069,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, is irrelevant for them. */ *alignment_support_scheme = dr_unaligned_supported; } + /* Try using LOAD/STORE_LANES. */ + else if (slp_node->ldst_lanes + && (*lanes_ifn + = (vls_type == VLS_LOAD + ? vect_load_lanes_supported (vectype, group_size, masked_p) + : vect_store_lanes_supported (vectype, group_size, + masked_p))) != IFN_LAST) + *memory_access_type = VMAT_LOAD_STORE_LANES; else *memory_access_type = VMAT_CONTIGUOUS; @@ -8762,7 +8770,7 @@ vectorizable_store (vec_info *vinfo, if (memory_access_type == VMAT_LOAD_STORE_LANES) { - gcc_assert (!slp && grouped_store); + //gcc_assert (!slp && grouped_store); unsigned inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector stores, we'd like to cost with the total number of them once instead of cost each one by one. */ @@ -8784,7 +8792,7 @@ vectorizable_store (vec_info *vinfo, op = vect_get_store_rhs (next_stmt_info); if (costing_p) update_prologue_cost (&prologue_cost, op); - else + else if (!slp) { vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies, op, @@ -8799,15 +8807,15 @@ vectorizable_store (vec_info *vinfo, { if (mask) { - vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, - mask, &vec_masks, - mask_vectype); + if (slp_node) + vect_get_slp_defs (mask_node, &vec_masks); + else + vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, + mask, &vec_masks, + mask_vectype); vec_mask = vec_masks[0]; } - /* We should have catched mismatched types earlier. */ - gcc_assert ( - useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); dataref_ptr = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, NULL, offset, &dummy, @@ -8819,11 +8827,17 @@ vectorizable_store (vec_info *vinfo, gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); /* DR_CHAIN is then used as an input to vect_permute_store_chain(). */ + if (!slp) + { + /* We should have catched mismatched types earlier. */ + gcc_assert ( + useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); for (i = 0; i < group_size; i++) { vec_oprnd = (*gvec_oprnds[i])[j]; dr_chain[i] = vec_oprnd; } + } if (mask) vec_mask = vec_masks[j]; dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi, @@ -8836,8 +8850,16 @@ vectorizable_store (vec_info *vinfo, continue; } + tree vec_array; + if (slp) + { + tree def = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[0])[j]; + vec_array = TREE_OPERAND (gimple_assign_lhs (SSA_NAME_DEF_STMT (def)), 0); + } + else + { /* Get an array into which we can store the individual vectors. */ - tree vec_array = create_vector_array (vectype, vec_num); + vec_array = create_vector_array (vectype, vec_num); /* Invalidate the current contents of VEC_ARRAY. This should become an RTL clobber too, which prevents the vector registers @@ -8851,6 +8873,7 @@ vectorizable_store (vec_info *vinfo, write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array, i); } + } tree final_mask = NULL; tree final_len = NULL; @@ -8917,9 +8940,10 @@ vectorizable_store (vec_info *vinfo, /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - if (j == 0) + if (j == 0 && !slp) *vec_stmt = new_stmt; - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); + if (!slp) + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); } if (costing_p) @@ -10765,7 +10789,7 @@ vectorizable_load (vec_info *vinfo, { gcc_assert (alignment_support_scheme == dr_aligned || alignment_support_scheme == dr_unaligned_supported); - gcc_assert (grouped_load && !slp); + //gcc_assert (grouped_load && !slp); unsigned int inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector loads, we'd like to cost with @@ -10884,6 +10908,15 @@ vectorizable_load (vec_info *vinfo, gimple_call_set_nothrow (call, true); vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); + if (slp) + { + tree def = copy_ssa_name (gimple_vuse (call)); + gimple_set_vdef (call, def); + SSA_NAME_DEF_STMT (def) = call; + slp_node->push_vec_def (def); + } + else + { dr_chain.create (vec_num); /* Extract each vector into an SSA_NAME. */ for (i = 0; i < vec_num; i++) @@ -10900,8 +10933,10 @@ vectorizable_load (vec_info *vinfo, vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); dr_chain.release (); + } - *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; + if (!slp_node) + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; } if (costing_p) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df86..ac288541c51 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -222,6 +222,9 @@ struct _slp_tree { unsigned int lanes; /* The operation of this node. */ enum tree_code code; + /* Whether uses of this load or feeders of this store are suitable + for load/store-lanes. */ + bool ldst_lanes; int vertex; -- 2.43.0