Re: [PATCH] Split load permutation
On Tue, 30 Jun 2020, Richard Sandiford wrote: > Richard Biener writes: > >> Another thing we'd like for SVE in general is to allow vectors *with* > >> gaps throughout the SLP graph, since with predication it's simple to > >> ignore any inactive element where necessary. This is often much cheaper > >> than packing the active elements together to remove gaps. For example: > >> > >> a[0] += 1; > >> a[2] += 1; > >> a[3] += 1; > >> > >> should just be a predicated load, add and store, with no shuffling. > >> > >> Even on targets without predication, it might be better to leave > >> the gaps in for part of the graph, e.g. if two loads feed a single > >> permute. So it would be good if the node representation allowed > >> arbitrary permutations, including with “dead” elements at any > >> position. (I realise that isn't news and that keeping the gap > >> removal with the load was just “for now” :-)) > > > > Hmm, OK. So I'm still experimenting with how to incrementally push > > these kind of changes to master. Unfortunately it resists any > > "serious" change because we've painted us into a corner with respect > > to load and store handling ;) Well, it just means the change will > > need to be bigger than anticipated. > > > > As for inactive lanes, for SVE this means a mask register for each > > operation, correct? > > Certainly for potentially trapping ops and reductions, yeah. > For others it really doesn't matter (and those wouldn't require > a target that supports predication). > > So I don't think having gaps necessarily means we have a mask. > Being able to attach masks to nodes (perhaps independently of gaps) > would be useful though. > > > At the moment the SLP graph is a representation > > of the scalar GIMPLE IL and we cannot really change that until we > > commit to a vectorization factor and more. So an inactive lane > > is simply not there and a "load" is simply another way of building > > up a vector from scalar stmt results much like those "built from scalars" > > SLP nodes but we of course make them special because we have those > > DR groups that are used. > > > > So with the patch as posted the "gaps" are represented in the > > load permutation of the "loads" which is where you could create > > mask registers from and simply push them to SLP parents (up to > > the point you get to a parent whose childs have differing masks...). > > OK. But I'd argue that's true of permutations too. At the point > that the SLP graph just represents scalar gimple IL, we can simply > permute SLP_TREE_SCALAR_STMTS and not have any explicit permute > operations in the graph. That's true when the only permutes are in leafs. The SLP_TREE_SCALAR_STMTS impose an order of lanes so once different parents need a different order we need explicit permute nodes swapping SLP_TREE_SCALAR_STMTS. Of course that's only because we have combined multiple scalar stmts into a single SLP node which really is ... > Permutations and gaps only come into play once we add more > vector-related information or restrictions. ... what those "vector-related" restrictions are right now. > > I think we're going to have a set of post-processing steps on the > > initial SLP graph for both "optimizing" where (and if) to do permutations > > and whether to elide gap removal in favor of masks (and we'd then > > annotate the SLP nodes with the active mask). > > So we'd start out without any permutations and gaps, and then add > them as post-processing step based on what the target can handle? > If so, sounds good to me FWIW. If you start with the notion on loads and stores being compositions/extracts then yes - based on how we want to vectorize those operations we'd expose any required permutation explicitely so we can then combine & optimize those. For the existing patches I tried to relate loads involving the same data-ref group by introducing a shared SLP node accessing the whole group, so that's not starting with no permute and it comes with some issues. If we don't relate loads from the same dataref group early (during SLP discovery, that is), then we have to try (again based on cost) doing that during that post-processing which then becomes more complicated. As said - this is work in progress and I'm experimenting in multiple directions ... > > All of this requires pushing back some of the early decisions > > (I've mostly identified vectorization factor determining) but also > > do load/store classification early. For example if we end up > > with strided loads or stores such operations can fuse with any > > permutation at no cost. > > > > At the moment I'm continuing to poke the code for its least > > resistance for introducing at least parts of the machinery. > > I'm targeting post-processing for merging of identical > > permutes across operations like it appears for > > > > a[0] = b[1] + c[1]; > > a[1] = b[0] + c[0]; > > > >> I guess this to some extent feeds into a long-standing TODO to allow > >> “don't
Re: [PATCH] Split load permutation
Richard Biener writes: >> Another thing we'd like for SVE in general is to allow vectors *with* >> gaps throughout the SLP graph, since with predication it's simple to >> ignore any inactive element where necessary. This is often much cheaper >> than packing the active elements together to remove gaps. For example: >> >> a[0] += 1; >> a[2] += 1; >> a[3] += 1; >> >> should just be a predicated load, add and store, with no shuffling. >> >> Even on targets without predication, it might be better to leave >> the gaps in for part of the graph, e.g. if two loads feed a single >> permute. So it would be good if the node representation allowed >> arbitrary permutations, including with “dead” elements at any >> position. (I realise that isn't news and that keeping the gap >> removal with the load was just “for now” :-)) > > Hmm, OK. So I'm still experimenting with how to incrementally push > these kind of changes to master. Unfortunately it resists any > "serious" change because we've painted us into a corner with respect > to load and store handling ;) Well, it just means the change will > need to be bigger than anticipated. > > As for inactive lanes, for SVE this means a mask register for each > operation, correct? Certainly for potentially trapping ops and reductions, yeah. For others it really doesn't matter (and those wouldn't require a target that supports predication). So I don't think having gaps necessarily means we have a mask. Being able to attach masks to nodes (perhaps independently of gaps) would be useful though. > At the moment the SLP graph is a representation > of the scalar GIMPLE IL and we cannot really change that until we > commit to a vectorization factor and more. So an inactive lane > is simply not there and a "load" is simply another way of building > up a vector from scalar stmt results much like those "built from scalars" > SLP nodes but we of course make them special because we have those > DR groups that are used. > > So with the patch as posted the "gaps" are represented in the > load permutation of the "loads" which is where you could create > mask registers from and simply push them to SLP parents (up to > the point you get to a parent whose childs have differing masks...). OK. But I'd argue that's true of permutations too. At the point that the SLP graph just represents scalar gimple IL, we can simply permute SLP_TREE_SCALAR_STMTS and not have any explicit permute operations in the graph. Permutations and gaps only come into play once we add more vector-related information or restrictions. > I think we're going to have a set of post-processing steps on the > initial SLP graph for both "optimizing" where (and if) to do permutations > and whether to elide gap removal in favor of masks (and we'd then > annotate the SLP nodes with the active mask). So we'd start out without any permutations and gaps, and then add them as post-processing step based on what the target can handle? If so, sounds good to me FWIW. > All of this requires pushing back some of the early decisions > (I've mostly identified vectorization factor determining) but also > do load/store classification early. For example if we end up > with strided loads or stores such operations can fuse with any > permutation at no cost. > > At the moment I'm continuing to poke the code for its least > resistance for introducing at least parts of the machinery. > I'm targeting post-processing for merging of identical > permutes across operations like it appears for > > a[0] = b[1] + c[1]; > a[1] = b[0] + c[0]; > >> I guess this to some extent feeds into a long-standing TODO to allow >> “don't care” elements in things like vec_perm_indices. > > Hmm, so when there's no masking and we don't want to remove gaps > we'd replace the permutation removing the gaps with an operation > that turns the inactive lanes to zero (bitwise and) or all ones > (bitwise or). Tracking the number of "vector" lanes compared > to the number of "real scalar" lanes in a SLP node might be enough > to handle this more or less transparently. I was also thinking of cases like: a[0] += b[0]; a[1] += c[1]; a[2] += b[2]; a[3] += c[3]; (for all targets). If we can somehow be smart about the handling of “c” and load a vector at c[0] rather than c[1], the rhs of the addition could be a blend of: b[0] b[2] c[1] c[3] Or, even if we can't: b[0] b[2] c[1] c[3] is a natural permute on some targets (e.g. Arm ones). This of course requires the usual proof that the other elements of b[] and c[] can be safely loaded. Thanks, Richard
Re: [PATCH] Split load permutation
On Tue, 30 Jun 2020, Richard Sandiford wrote: > Sorry for the slow reponse… No problem. > Richard Biener writes: > > This splits SLP load nodes with load permutation into a SLP > > load node (with load "permutation" removing gaps) and a > > lane permutation node. The first and foremost goal of this > > is to be able to have a single SLP node for each load group > > so we can start making decisions about how to vectorize > > them factoring in all loads of that group. The second > > goal is to eventually be able to optimize permutations by > > pushing them down where they can be combined from multiple > > children to the output. We do have one very special case > > handled by vect_attempt_slp_rearrange_stmts doing it all > > the way down for reductions that are associative. > > Sounds great! > > > For example for > > > > l1 = a[0]; l2 = a[1]; > > b[0] = l1; b[1] = l2; > > c[0] = l2; c[1] = l1; > > > > wecan avoid generating loads twice. For > > > > l1 = a[0]; l2 = a[1]; l3 = a[2]; > > b[0] = l1; b[1] = l2; > > c[0] = l2; c[1] = l3; > > > > we will have a SLP load node with three lanes plus > > two lane permutation nodes selecting two lanes each. In > > a loop context this will cause a VF of two and three > > loads per vectorized loop iteration (plus two permutes) > > while previously we used a VF of one with two loads > > and no permutation per loop iteration. In the new > > scheme the number of loads is less but we pay with > > permutes and a higher VF. > > > > There is (bad) interaction with determining of the vectorization > > factor which for BB vectorization causes missed vectorizations > > because the "load parts of a dataref group directly" optimization > > is not (yet) reflected in the SLP graph. > > > > There is (bad) interaction with CTOR vectorization since we now > > get confused about where to insert vectorized stmts when combining > > two previously distinct SLP graphs. > > > > My immediate focus is on the SSA verification FAILs but the > > other part points at a missing piece in this - a "pass" > > that "optimizes" the SLP graph with respect to permutations > > and loads, ultimatively for example deciding between using > > interleaving schemes, scalar loads, "SLP" + permutation, > > load-lanes, etc.; This is also the thing that blocks > > SLP only (apart from teaching the few pieces that cannot do SLP > > to do SLP). > > > > I'm posting this mostly to make people think how it fits > > their future development and architecture features. > > Yeah, the interleaving scheme is something we'd very much like for SVE2, > where for some operations that produce multiple vectors, it would be better > to organise them on an even/odd division instead of a low/high division. > There are instructions both to produce and to consume values in odd/even > form, so in many cases no explicit permutation would be needed. > > I've also seen cases for downward loops where we reverse vectors after > loading and reverse them again before storing, even though the loop > could easily have operated on the reversed vectors directly. > > Another thing we'd like for SVE in general is to allow vectors *with* > gaps throughout the SLP graph, since with predication it's simple to > ignore any inactive element where necessary. This is often much cheaper > than packing the active elements together to remove gaps. For example: > > a[0] += 1; > a[2] += 1; > a[3] += 1; > > should just be a predicated load, add and store, with no shuffling. > > Even on targets without predication, it might be better to leave > the gaps in for part of the graph, e.g. if two loads feed a single > permute. So it would be good if the node representation allowed > arbitrary permutations, including with “dead” elements at any > position. (I realise that isn't news and that keeping the gap > removal with the load was just “for now” :-)) Hmm, OK. So I'm still experimenting with how to incrementally push these kind of changes to master. Unfortunately it resists any "serious" change because we've painted us into a corner with respect to load and store handling ;) Well, it just means the change will need to be bigger than anticipated. As for inactive lanes, for SVE this means a mask register for each operation, correct? At the moment the SLP graph is a representation of the scalar GIMPLE IL and we cannot really change that until we commit to a vectorization factor and more. So an inactive lane is simply not there and a "load" is simply another way of building up a vector from scalar stmt results much like those "built from scalars" SLP nodes but we of course make them special because we have those DR groups that are used. So with the patch as posted the "gaps" are represented in the load permutation of the "loads" which is where you could create mask registers from and simply push them to SLP parents (up to the point you get to a parent whose childs have differing masks...). I think we're going to
Re: [PATCH] Split load permutation
Sorry for the slow reponse… Richard Biener writes: > This splits SLP load nodes with load permutation into a SLP > load node (with load "permutation" removing gaps) and a > lane permutation node. The first and foremost goal of this > is to be able to have a single SLP node for each load group > so we can start making decisions about how to vectorize > them factoring in all loads of that group. The second > goal is to eventually be able to optimize permutations by > pushing them down where they can be combined from multiple > children to the output. We do have one very special case > handled by vect_attempt_slp_rearrange_stmts doing it all > the way down for reductions that are associative. Sounds great! > For example for > > l1 = a[0]; l2 = a[1]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l1; > > wecan avoid generating loads twice. For > > l1 = a[0]; l2 = a[1]; l3 = a[2]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l3; > > we will have a SLP load node with three lanes plus > two lane permutation nodes selecting two lanes each. In > a loop context this will cause a VF of two and three > loads per vectorized loop iteration (plus two permutes) > while previously we used a VF of one with two loads > and no permutation per loop iteration. In the new > scheme the number of loads is less but we pay with > permutes and a higher VF. > > There is (bad) interaction with determining of the vectorization > factor which for BB vectorization causes missed vectorizations > because the "load parts of a dataref group directly" optimization > is not (yet) reflected in the SLP graph. > > There is (bad) interaction with CTOR vectorization since we now > get confused about where to insert vectorized stmts when combining > two previously distinct SLP graphs. > > My immediate focus is on the SSA verification FAILs but the > other part points at a missing piece in this - a "pass" > that "optimizes" the SLP graph with respect to permutations > and loads, ultimatively for example deciding between using > interleaving schemes, scalar loads, "SLP" + permutation, > load-lanes, etc.; This is also the thing that blocks > SLP only (apart from teaching the few pieces that cannot do SLP > to do SLP). > > I'm posting this mostly to make people think how it fits > their future development and architecture features. Yeah, the interleaving scheme is something we'd very much like for SVE2, where for some operations that produce multiple vectors, it would be better to organise them on an even/odd division instead of a low/high division. There are instructions both to produce and to consume values in odd/even form, so in many cases no explicit permutation would be needed. I've also seen cases for downward loops where we reverse vectors after loading and reverse them again before storing, even though the loop could easily have operated on the reversed vectors directly. Another thing we'd like for SVE in general is to allow vectors *with* gaps throughout the SLP graph, since with predication it's simple to ignore any inactive element where necessary. This is often much cheaper than packing the active elements together to remove gaps. For example: a[0] += 1; a[2] += 1; a[3] += 1; should just be a predicated load, add and store, with no shuffling. Even on targets without predication, it might be better to leave the gaps in for part of the graph, e.g. if two loads feed a single permute. So it would be good if the node representation allowed arbitrary permutations, including with “dead” elements at any position. (I realise that isn't news and that keeping the gap removal with the load was just “for now” :-)) I guess this to some extent feeds into a long-standing TODO to allow “don't care” elements in things like vec_perm_indices. Thanks, Richard
[PATCH] Split load permutation
This splits SLP load nodes with load permutation into a SLP load node (with load "permutation" removing gaps) and a lane permutation node. The first and foremost goal of this is to be able to have a single SLP node for each load group so we can start making decisions about how to vectorize them factoring in all loads of that group. The second goal is to eventually be able to optimize permutations by pushing them down where they can be combined from multiple children to the output. We do have one very special case handled by vect_attempt_slp_rearrange_stmts doing it all the way down for reductions that are associative. For example for l1 = a[0]; l2 = a[1]; b[0] = l1; b[1] = l2; c[0] = l2; c[1] = l1; wecan avoid generating loads twice. For l1 = a[0]; l2 = a[1]; l3 = a[2]; b[0] = l1; b[1] = l2; c[0] = l2; c[1] = l3; we will have a SLP load node with three lanes plus two lane permutation nodes selecting two lanes each. In a loop context this will cause a VF of two and three loads per vectorized loop iteration (plus two permutes) while previously we used a VF of one with two loads and no permutation per loop iteration. In the new scheme the number of loads is less but we pay with permutes and a higher VF. There is (bad) interaction with determining of the vectorization factor which for BB vectorization causes missed vectorizations because the "load parts of a dataref group directly" optimization is not (yet) reflected in the SLP graph. There is (bad) interaction with CTOR vectorization since we now get confused about where to insert vectorized stmts when combining two previously distinct SLP graphs. My immediate focus is on the SSA verification FAILs but the other part points at a missing piece in this - a "pass" that "optimizes" the SLP graph with respect to permutations and loads, ultimatively for example deciding between using interleaving schemes, scalar loads, "SLP" + permutation, load-lanes, etc.; This is also the thing that blocks SLP only (apart from teaching the few pieces that cannot do SLP to do SLP). I'm posting this mostly to make people think how it fits their future development and architecture features. And of course to see whether there are already made up ideas how to structure such "optimization". Thanks, Richard. PS: patch doesn't bootstrap, it ICEs during libgfortran build. C vect.exp FAILs are FAIL: gcc.dg/vect/pr59354.c (internal compiler error) FAIL: gcc.dg/vect/pr88903-2.c (internal compiler error) FAIL: gcc.dg/vect/vect-alias-check-10.c (internal compiler error) FAIL: gcc.dg/vect/vect-alias-check-11.c (internal compiler error) FAIL: gcc.dg/vect/vect-alias-check-12.c (internal compiler error) FAIL: gcc.dg/vect/slp-23.c scan-tree-dump-times vect "vectorizing stmts using SLP" 2 FAIL: gcc.dg/vect/slp-43.c (internal compiler error) FAIL: gcc.dg/vect/bb-slp-19.c scan-tree-dump-times slp2 "basic block vectorized" 1 FAIL: gcc.dg/vect/bb-slp-29.c scan-tree-dump-times slp1 "basic block vectorized" 1 FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "basic block vectorized" 1 FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1 FAIL: gcc.dg/vect/bb-slp-subgroups-2.c (internal compiler error) FAIL: gcc.dg/vect/bb-slp-subgroups-3.c (internal compiler error) * tree-vectorizer.h (vect_get_place_in_interleaving_chain): Add flag paramter defaulted to true. * tree-vect-slp.c (vect_get_place_in_interleaving_chain): Imlement alternate mode not counting gaps. (vect_build_slp_tree_2): Split load nodes into load and lane permutation. (vect_attempt_slp_rearrange_stmts): Pass in the unroll factor to consider. (calculate_unrolling_factor): New function. (vect_analyze_slp_instance): Adjust. (vect_analyze_slp): Nail down the unroll factor before optimizing permutations in SLP reductions. (vect_make_slp_decision): Remove determining unroll factor here. (vect_schedule_slp_instance): Adjust where we emit vectorized stmts. --- gcc/tree-vect-slp.c | 246 +- gcc/tree-vectorizer.h | 3 +- 2 files changed, 199 insertions(+), 50 deletions(-) diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 5c169f37022..68b0750db44 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -244,7 +244,8 @@ vect_contains_pattern_stmt_p (vec stmts) int vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info, - stmt_vec_info first_stmt_info) + stmt_vec_info first_stmt_info, + bool add_gaps) { stmt_vec_info next_stmt_info = first_stmt_info; int result = 0; @@ -258,7 +259,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info, return result; next_stmt_info =