On Tue, 21 May 2024, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > The following avoids splitting store dataref groups during SLP > > discovery but instead forces (eventually single-lane) consecutive > > lane SLP discovery for all lanes of the group, creating VEC_PERM > > SLP nodes merging them so the store will always cover the whole group. > > > > With this for example > > > > int x[1024], y[1024], z[1024], w[1024]; > > void foo (void) > > { > > for (int i = 0; i < 256; i++) > > { > > x[4*i+0] = y[2*i+0]; > > x[4*i+1] = y[2*i+1]; > > x[4*i+2] = z[i]; > > x[4*i+3] = w[i]; > > } > > } > > > > which was previously using hybrid SLP can now be fully SLPed and > > Nice! > > > SSE code generated looks better (but of course you never know, > > I didn't actually benchmark). We of course need a VF of four here. > > > > .L2: > > movdqa z(%rax), %xmm0 > > movdqa w(%rax), %xmm4 > > movdqa y(%rax,%rax), %xmm2 > > movdqa y+16(%rax,%rax), %xmm1 > > movdqa %xmm0, %xmm3 > > punpckhdq %xmm4, %xmm0 > > punpckldq %xmm4, %xmm3 > > movdqa %xmm2, %xmm4 > > shufps $238, %xmm3, %xmm2 > > movaps %xmm2, x+16(,%rax,4) > > movdqa %xmm1, %xmm2 > > shufps $68, %xmm3, %xmm4 > > shufps $68, %xmm0, %xmm2 > > movaps %xmm4, x(,%rax,4) > > shufps $238, %xmm0, %xmm1 > > movaps %xmm2, x+32(,%rax,4) > > movaps %xmm1, x+48(,%rax,4) > > addq $16, %rax > > cmpq $1024, %rax > > jne .L2 > > > > The extra permute nodes merging distinct branches of the SLP > > tree might be unexpected for some code, esp. since > > SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we > > cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS > > consistently as we can have a mix of both. > > > > The patch keeps the sub-trees form consecutive lanes but that's > > in principle not necessary if we for example have an even/odd > > split which now would result in N single-lane sub-trees. That's > > left for future improvements. > > > > The interesting part is how VLA vector ISAs handle merging of > > two vectors that's not trivial even/odd merging. The strathegy > > of how to build the permute tree might need adjustments for that > > (in the end splitting each branch to single lanes and then doing > > even/odd merging would be the brute-force fallback). Not sure > > how much we can or should rely on the SLP optimize pass to handle > > this. > > Yeah, I think we'll have to play it by ear. It might involve tweaking > the order in which we "reduce" the VEC_PERM_EXPRs. E.g. in the above > example, my guess is that it would be better to reduce the z/w part > first and then permute that with y, whereas it looks like the patch > always goes left-to-right.
The patch reduces the two inputs with the least number of lanes recursively. And within that from left-to-right. That should keep us in the bound of two input vectors for one output vector. It should also resemble classical interleaving when we have N single lanes. > The patch LGTM FWIW. I've sent out a v2 for the CIs and pushed the bugfix parts of the series. I hope to see that riscv isn't left with 100s of FAILs beause of the change and if that looks green push and polish up what I have for the load side. > I suppose this does further hard-code the assumption that the vector > type is uniquely determined by the element type (and so we can safely > assume that everything has the same vector type as the first split node). > But that's pretty much pervasive, and not easy to solve until we're > serious about putting some infrastructre in place for it. It just > caught me out when reading vector code for the first time in a while :) > > (E.g. in the above example, the y vector could eventually be double the > z & w vectors.) Yeah, you might have noticed the RFC patch series I sent out last year where I tried to get rid of this constraint. I stopped implementing when I figured it should work but doing all-SLP first really is important. Richard. > Thanks, > Richard > > > * tree-vect-slp.cc (vect_build_slp_instance): Do not split > > store dataref groups on loop SLP discovery failure but create > > a single SLP instance for the stores but branch to SLP sub-trees > > and merge with a series of VEC_PERM nodes. > > --- > > gcc/tree-vect-slp.cc | 240 ++++++++++++++++++++++++++++++++++++++----- > > 1 file changed, 214 insertions(+), 26 deletions(-) > > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index 43f2c153bf0..873748b0a72 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -3468,12 +3468,7 @@ vect_build_slp_instance (vec_info *vinfo, > > return true; > > } > > } > > - else > > - { > > - /* Failed to SLP. */ > > - /* Free the allocated memory. */ > > - scalar_stmts.release (); > > - } > > + /* Failed to SLP. */ > > > > stmt_vec_info stmt_info = stmt_info_; > > /* Try to break the group up into pieces. */ > > @@ -3491,6 +3486,9 @@ vect_build_slp_instance (vec_info *vinfo, > > if (is_a <bb_vec_info> (vinfo) > > && (i > 1 && i < group_size)) > > { > > + /* Free the allocated memory. */ > > + scalar_stmts.release (); > > + > > tree scalar_type > > = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info))); > > tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type, > > @@ -3535,38 +3533,228 @@ vect_build_slp_instance (vec_info *vinfo, > > } > > } > > > > - /* For loop vectorization split into arbitrary pieces of size > 1. > > */ > > - if (is_a <loop_vec_info> (vinfo) > > - && (i > 1 && i < group_size) > > - && !vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, i)) > > + /* For loop vectorization split the RHS into arbitrary pieces of > > + size >= 1. */ > > + else if (is_a <loop_vec_info> (vinfo) > > + && (i > 0 && i < group_size) > > + && !vect_slp_prefer_store_lanes_p (vinfo, > > + stmt_info, group_size, i)) > > { > > - unsigned group1_size = i; > > - > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > "Splitting SLP group at stmt %u\n", i); > > > > - stmt_vec_info rest = vect_split_slp_store_group (stmt_info, > > - group1_size); > > - /* Loop vectorization cannot handle gaps in stores, make sure > > - the split group appears as strided. */ > > - STMT_VINFO_STRIDED_P (rest) = 1; > > - DR_GROUP_GAP (rest) = 0; > > - STMT_VINFO_STRIDED_P (stmt_info) = 1; > > - DR_GROUP_GAP (stmt_info) = 0; > > + /* Analyze the stored values and pinch them together with > > + a permute node so we can preserve the whole store group. */ > > + auto_vec<slp_tree> rhs_nodes; > > + > > + /* Calculate the unrolling factor based on the smallest type. */ > > + poly_uint64 unrolling_factor = 1; > > + > > + unsigned int start = 0, end = i; > > + while (start < group_size) > > + { > > + gcc_assert (end - start >= 1); > > + vec<stmt_vec_info> substmts; > > + substmts.create (end - start); > > + for (unsigned j = start; j < end; ++j) > > + substmts.quick_push (scalar_stmts[j]); > > + max_nunits = 1; > > + node = vect_build_slp_tree (vinfo, substmts, end - start, > > + &max_nunits, > > + matches, limit, &tree_size, bst_map); > > + if (node) > > + { > > + /* ??? Possibly not safe, but not sure how to check > > + and fail SLP build? */ > > + unrolling_factor > > + = force_common_multiple (unrolling_factor, > > + calculate_unrolling_factor > > + (max_nunits, end - start)); > > + rhs_nodes.safe_push (node); > > + start = end; > > + end = group_size; > > + } > > + else > > + { > > + substmts.release (); > > + if (end - start == 1) > > + { > > + /* Single-lane discovery failed. Free ressources. */ > > + for (auto node : rhs_nodes) > > + vect_free_slp_tree (node); > > + scalar_stmts.release (); > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "SLP discovery failed\n"); > > + return false; > > + } > > + > > + /* ??? It really happens that we soft-fail SLP > > + build at a mismatch but the matching part hard-fails > > + later. As we know we arrived here with a group > > + larger than one try a group of size one! */ > > + if (!matches[0]) > > + end = start + 1; > > + else > > + for (unsigned j = start; j < end; j++) > > + if (!matches[j - start]) > > + { > > + end = j; > > + break; > > + } > > + } > > + } > > + > > + /* Now we assume we can build the root SLP node from all > > + stores. */ > > + node = vect_create_new_slp_node (scalar_stmts, 1); > > + SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]); > > + /* And a permute merging all RHS SLP trees. */ > > + slp_tree perm = vect_create_new_slp_node (rhs_nodes.length (), > > + VEC_PERM_EXPR); > > + SLP_TREE_CHILDREN (node).quick_push (perm); > > + SLP_TREE_LANE_PERMUTATION (perm).create (group_size); > > + SLP_TREE_VECTYPE (perm) = SLP_TREE_VECTYPE (node); > > + SLP_TREE_LANES (perm) = group_size; > > + /* ??? We should set this NULL but that's not expected. */ > > + SLP_TREE_REPRESENTATIVE (perm) > > + = SLP_TREE_REPRESENTATIVE (SLP_TREE_CHILDREN (rhs_nodes[0])[0]); > > + for (unsigned j = 0; j < rhs_nodes.length (); ++j) > > + { > > + SLP_TREE_CHILDREN (perm) > > + .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]); > > + for (unsigned k = 0; > > + k < SLP_TREE_SCALAR_STMTS (rhs_nodes[j]).length (); ++k) > > + { > > + /* ??? We should populate SLP_TREE_SCALAR_STMTS > > + or SLP_TREE_SCALAR_OPS but then we might have > > + a mix of both in our children. */ > > + SLP_TREE_LANE_PERMUTATION (perm) > > + .quick_push (std::make_pair (j, k)); > > + } > > + } > > + > > + /* Now we have a single permute node but we cannot code-generate > > + the case with more than two inputs. > > + Perform pairwise reduction, reducing the two inputs > > + with the least number of lanes to one and then repeat until > > + we end up with two inputs. That scheme makes sure we end > > + up with permutes satisfying the restriction of requiring > > + at most two vector inputs to produce a single vector output. */ > > + while (SLP_TREE_CHILDREN (perm).length () > 2) > > + { > > + /* Pick the two nodes with the least number of lanes, > > + prefer the earliest candidate and maintain ai < bi. */ > > + int ai = -1; > > + int bi = -1; > > + for (unsigned ci = 0; > > + ci < SLP_TREE_CHILDREN (perm).length (); ++ci) > > + { > > + if (ai == -1) > > + ai = ci; > > + else if (bi == -1) > > + bi = ci; > > + else if ((SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[ci]) > > + < SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[ai])) > > + || (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[ci]) > > + < SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[bi]))) > > + { > > + if (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[ai]) > > + <= SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[bi])) > > + bi = ci; > > + else > > + { > > + ai = bi; > > + bi = ci; > > + } > > + } > > + } > > + > > + /* Produce a merge of nodes ai and bi. */ > > + slp_tree a = SLP_TREE_CHILDREN (perm)[ai]; > > + slp_tree b = SLP_TREE_CHILDREN (perm)[bi]; > > + unsigned n = SLP_TREE_LANES (a) + SLP_TREE_LANES (b); > > + slp_tree permab = vect_create_new_slp_node (2, VEC_PERM_EXPR); > > + SLP_TREE_LANES (permab) = n; > > + SLP_TREE_LANE_PERMUTATION (permab).create (n); > > + SLP_TREE_VECTYPE (permab) = SLP_TREE_VECTYPE (perm); /* ??? */ > > + /* ??? We should set this NULL but that's not expected. */ > > + SLP_TREE_REPRESENTATIVE (permab) = SLP_TREE_REPRESENTATIVE (perm); > > + SLP_TREE_CHILDREN (permab).quick_push (a); > > + for (unsigned k = 0; k < SLP_TREE_LANES (a); ++k) > > + SLP_TREE_LANE_PERMUTATION (permab) > > + .quick_push (std::make_pair (0, k)); > > + SLP_TREE_CHILDREN (permab).quick_push (b); > > + for (unsigned k = 0; k < SLP_TREE_LANES (b); ++k) > > + SLP_TREE_LANE_PERMUTATION (permab) > > + .quick_push (std::make_pair (1, k)); > > + > > + /* Put the merged node into 'perm', in place of a */ > > + SLP_TREE_CHILDREN (perm)[ai] = permab; > > + /* Adjust the references to b in the permutation > > + of perm and to the later children which we'll > > + remove. */ > > + for (unsigned k = 0; k < SLP_TREE_LANES (perm); ++k) > > + { > > + std::pair<unsigned, unsigned> &p > > + = SLP_TREE_LANE_PERMUTATION (perm)[k]; > > + if (p.first == (unsigned) bi) > > + { > > + p.first = ai; > > + p.second += SLP_TREE_LANES (a); > > + } > > + else if (p.first > (unsigned) bi) > > + p.first--; > > + } > > + SLP_TREE_CHILDREN (perm).ordered_remove (bi); > > + } > > + > > + /* Create a new SLP instance. */ > > + slp_instance new_instance = XNEW (class _slp_instance); > > + SLP_INSTANCE_TREE (new_instance) = node; > > + SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = unrolling_factor; > > + SLP_INSTANCE_LOADS (new_instance) = vNULL; > > + SLP_INSTANCE_ROOT_STMTS (new_instance) = root_stmt_infos; > > + SLP_INSTANCE_REMAIN_DEFS (new_instance) = remain; > > + SLP_INSTANCE_KIND (new_instance) = kind; > > + new_instance->reduc_phis = NULL; > > + new_instance->cost_vec = vNULL; > > + new_instance->subgraph_entries = vNULL; > > + > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "SLP size %u vs. limit %u.\n", > > + tree_size, max_tree_size); > > > > - bool res = vect_analyze_slp_instance (vinfo, bst_map, stmt_info, > > - kind, max_tree_size, limit); > > - if (i + 1 < group_size) > > - res |= vect_analyze_slp_instance (vinfo, bst_map, > > - rest, kind, max_tree_size, limit); > > + vinfo->slp_instances.safe_push (new_instance); > > + > > + /* ??? We've replaced the old SLP_INSTANCE_GROUP_SIZE with > > + the number of scalar stmts in the root in a few places. > > + Verify that assumption holds. */ > > + gcc_assert (SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (new_instance)) > > + .length () == group_size); > > > > - return res; > > + if (dump_enabled_p ()) > > + { > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Final SLP tree for instance %p:\n", > > + (void *) new_instance); > > + vect_print_slp_graph (MSG_NOTE, vect_location, > > + SLP_INSTANCE_TREE (new_instance)); > > + } > > + return true; > > } > > + else > > + /* Free the allocated memory. */ > > + scalar_stmts.release (); > > > > /* Even though the first vector did not all match, we might be able > > to SLP > > (some) of the remainder. FORNOW ignore this possibility. */ > > } > > + else > > + /* Free the allocated memory. */ > > + scalar_stmts.release (); > > > > /* Failed to SLP. */ > > if (dump_enabled_p ()) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)