Re: [PATCH] Split load permutation

2020-07-01 Thread Richard Biener
On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Richard Biener  writes:
> >> Another thing we'd like for SVE in general is to allow vectors *with*
> >> gaps throughout the SLP graph, since with predication it's simple to
> >> ignore any inactive element where necessary.  This is often much cheaper
> >> than packing the active elements together to remove gaps.  For example:
> >> 
> >>   a[0] += 1;
> >>   a[2] += 1;
> >>   a[3] += 1;
> >> 
> >> should just be a predicated load, add and store, with no shuffling.
> >> 
> >> Even on targets without predication, it might be better to leave
> >> the gaps in for part of the graph, e.g. if two loads feed a single
> >> permute.  So it would be good if the node representation allowed
> >> arbitrary permutations, including with “dead” elements at any
> >> position.  (I realise that isn't news and that keeping the gap
> >> removal with the load was just “for now” :-))
> >
> > Hmm, OK.  So I'm still experimenting with how to incrementally push
> > these kind of changes to master.  Unfortunately it resists any
> > "serious" change because we've painted us into a corner with respect
> > to load and store handling ;)  Well, it just means the change will
> > need to be bigger than anticipated.
> >
> > As for inactive lanes, for SVE this means a mask register for each
> > operation, correct?
> 
> Certainly for potentially trapping ops and reductions, yeah.
> For others it really doesn't matter (and those wouldn't require
> a target that supports predication).
> 
> So I don't think having gaps necessarily means we have a mask.
> Being able to attach masks to nodes (perhaps independently of gaps)
> would be useful though.
> 
> > At the moment the SLP graph is a representation
> > of the scalar GIMPLE IL and we cannot really change that until we
> > commit to a vectorization factor and more.  So an inactive lane
> > is simply not there and a "load" is simply another way of building
> > up a vector from scalar stmt results much like those "built from scalars"
> > SLP nodes but we of course make them special because we have those
> > DR groups that are used.
> >
> > So with the patch as posted the "gaps" are represented in the
> > load permutation of the "loads" which is where you could create
> > mask registers from and simply push them to SLP parents (up to
> > the point you get to a parent whose childs have differing masks...).
> 
> OK.  But I'd argue that's true of permutations too.  At the point
> that the SLP graph just represents scalar gimple IL, we can simply
> permute SLP_TREE_SCALAR_STMTS and not have any explicit permute
> operations in the graph.

That's true when the only permutes are in leafs.  The 
SLP_TREE_SCALAR_STMTS impose an order of lanes so once
different parents need a different order we need explicit permute
nodes swapping SLP_TREE_SCALAR_STMTS.  Of course that's only
because we have combined multiple scalar stmts into a single
SLP node which really is ...

> Permutations and gaps only come into play once we add more
> vector-related information or restrictions.

... what those "vector-related" restrictions are right now.

> > I think we're going to have a set of post-processing steps on the
> > initial SLP graph for both "optimizing" where (and if) to do permutations
> > and whether to elide gap removal in favor of masks (and we'd then
> > annotate the SLP nodes with the active mask).
> 
> So we'd start out without any permutations and gaps, and then add
> them as post-processing step based on what the target can handle?
> If so, sounds good to me FWIW.

If you start with the notion on loads and stores being
compositions/extracts then yes - based on how we want to
vectorize those operations we'd expose any required permutation
explicitely so we can then combine & optimize those.

For the existing patches I tried to relate loads involving the
same data-ref group by introducing a shared SLP node accessing the
whole group, so that's not starting with no permute and it comes
with some issues.  If we don't relate loads from the same dataref
group early (during SLP discovery, that is), then we have to
try (again based on cost) doing that during that post-processing
which then becomes more complicated.

As said - this is work in progress and I'm experimenting in
multiple directions ...

> > All of this requires pushing back some of the early decisions
> > (I've mostly identified vectorization factor determining) but also
> > do load/store classification early.  For example if we end up
> > with strided loads or stores such operations can fuse with any
> > permutation at no cost.
> >
> > At the moment I'm continuing to poke the code for its least
> > resistance for introducing at least parts of the machinery.
> > I'm targeting post-processing for merging of identical
> > permutes across operations like it appears for
> >
> >   a[0] = b[1] + c[1];
> >   a[1] = b[0] + c[0];
> >
> >> I guess this to some extent feeds into a long-standing TODO to allow
> >> “don't 

Re: [PATCH] Split load permutation

2020-06-30 Thread Richard Sandiford
Richard Biener  writes:
>> Another thing we'd like for SVE in general is to allow vectors *with*
>> gaps throughout the SLP graph, since with predication it's simple to
>> ignore any inactive element where necessary.  This is often much cheaper
>> than packing the active elements together to remove gaps.  For example:
>> 
>>   a[0] += 1;
>>   a[2] += 1;
>>   a[3] += 1;
>> 
>> should just be a predicated load, add and store, with no shuffling.
>> 
>> Even on targets without predication, it might be better to leave
>> the gaps in for part of the graph, e.g. if two loads feed a single
>> permute.  So it would be good if the node representation allowed
>> arbitrary permutations, including with “dead” elements at any
>> position.  (I realise that isn't news and that keeping the gap
>> removal with the load was just “for now” :-))
>
> Hmm, OK.  So I'm still experimenting with how to incrementally push
> these kind of changes to master.  Unfortunately it resists any
> "serious" change because we've painted us into a corner with respect
> to load and store handling ;)  Well, it just means the change will
> need to be bigger than anticipated.
>
> As for inactive lanes, for SVE this means a mask register for each
> operation, correct?

Certainly for potentially trapping ops and reductions, yeah.
For others it really doesn't matter (and those wouldn't require
a target that supports predication).

So I don't think having gaps necessarily means we have a mask.
Being able to attach masks to nodes (perhaps independently of gaps)
would be useful though.

> At the moment the SLP graph is a representation
> of the scalar GIMPLE IL and we cannot really change that until we
> commit to a vectorization factor and more.  So an inactive lane
> is simply not there and a "load" is simply another way of building
> up a vector from scalar stmt results much like those "built from scalars"
> SLP nodes but we of course make them special because we have those
> DR groups that are used.
>
> So with the patch as posted the "gaps" are represented in the
> load permutation of the "loads" which is where you could create
> mask registers from and simply push them to SLP parents (up to
> the point you get to a parent whose childs have differing masks...).

OK.  But I'd argue that's true of permutations too.  At the point
that the SLP graph just represents scalar gimple IL, we can simply
permute SLP_TREE_SCALAR_STMTS and not have any explicit permute
operations in the graph.

Permutations and gaps only come into play once we add more
vector-related information or restrictions.

> I think we're going to have a set of post-processing steps on the
> initial SLP graph for both "optimizing" where (and if) to do permutations
> and whether to elide gap removal in favor of masks (and we'd then
> annotate the SLP nodes with the active mask).

So we'd start out without any permutations and gaps, and then add
them as post-processing step based on what the target can handle?
If so, sounds good to me FWIW.

> All of this requires pushing back some of the early decisions
> (I've mostly identified vectorization factor determining) but also
> do load/store classification early.  For example if we end up
> with strided loads or stores such operations can fuse with any
> permutation at no cost.
>
> At the moment I'm continuing to poke the code for its least
> resistance for introducing at least parts of the machinery.
> I'm targeting post-processing for merging of identical
> permutes across operations like it appears for
>
>   a[0] = b[1] + c[1];
>   a[1] = b[0] + c[0];
>
>> I guess this to some extent feeds into a long-standing TODO to allow
>> “don't care” elements in things like vec_perm_indices.
>
> Hmm, so when there's no masking and we don't want to remove gaps
> we'd replace the permutation removing the gaps with an operation
> that turns the inactive lanes to zero (bitwise and) or all ones
> (bitwise or).  Tracking the number of "vector" lanes compared
> to the number of "real scalar" lanes in a SLP node might be enough
> to handle this more or less transparently.

I was also thinking of cases like:

  a[0] += b[0];
  a[1] += c[1];
  a[2] += b[2];
  a[3] += c[3];

(for all targets).

If we can somehow be smart about the handling of “c” and load a vector
at c[0] rather than c[1], the rhs of the addition could be a blend of:

  b[0]  b[2] 
   c[1]  c[3]

Or, even if we can't:

  b[0]  b[2] 
  c[1]  c[3] 

is a natural permute on some targets (e.g. Arm ones).

This of course requires the usual proof that the other elements of b[]
and c[] can be safely loaded.

Thanks,
Richard


Re: [PATCH] Split load permutation

2020-06-30 Thread Richard Biener
On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Sorry for the slow reponse…

No problem.

> Richard Biener  writes:
> > This splits SLP load nodes with load permutation into a SLP
> > load node (with load "permutation" removing gaps) and a
> > lane permutation node.  The first and foremost goal of this
> > is to be able to have a single SLP node for each load group
> > so we can start making decisions about how to vectorize
> > them factoring in all loads of that group.  The second
> > goal is to eventually be able to optimize permutations by
> > pushing them down where they can be combined from multiple
> > children to the output.  We do have one very special case
> > handled by vect_attempt_slp_rearrange_stmts doing it all
> > the way down for reductions that are associative.
> 
> Sounds great!
> 
> > For example for
> >
> >   l1 = a[0]; l2 = a[1];
> >   b[0] = l1; b[1] = l2;
> >   c[0] = l2; c[1] = l1;
> >
> > wecan avoid generating loads twice.  For
> >
> >   l1 = a[0]; l2 = a[1]; l3 = a[2];
> >   b[0] = l1; b[1] = l2;
> >  c[0] = l2; c[1] = l3;
> >
> > we will have a SLP load node with three lanes plus
> > two lane permutation nodes selecting two lanes each.  In
> > a loop context this will cause a VF of two and three
> > loads per vectorized loop iteration (plus two permutes)
> > while previously we used a VF of one with two loads
> > and no permutation per loop iteration.  In the new
> > scheme the number of loads is less but we pay with
> > permutes and a higher VF.
> >
> > There is (bad) interaction with determining of the vectorization
> > factor which for BB vectorization causes missed vectorizations
> > because the "load parts of a dataref group directly" optimization
> > is not (yet) reflected in the SLP graph.
> >
> > There is (bad) interaction with CTOR vectorization since we now
> > get confused about where to insert vectorized stmts when combining
> > two previously distinct SLP graphs.
> >
> > My immediate focus is on the SSA verification FAILs but the
> > other part points at a missing piece in this - a "pass"
> > that "optimizes" the SLP graph with respect to permutations
> > and loads, ultimatively for example deciding between using
> > interleaving schemes, scalar loads, "SLP" + permutation,
> > load-lanes, etc.;  This is also the thing that blocks
> > SLP only (apart from teaching the few pieces that cannot do SLP
> > to do SLP).
> >
> > I'm posting this mostly to make people think how it fits
> > their future development and architecture features.
> 
> Yeah, the interleaving scheme is something we'd very much like for SVE2,
> where for some operations that produce multiple vectors, it would be better
> to organise them on an even/odd division instead of a low/high division.
> There are instructions both to produce and to consume values in odd/even
> form, so in many cases no explicit permutation would be needed.
> 
> I've also seen cases for downward loops where we reverse vectors after
> loading and reverse them again before storing, even though the loop
> could easily have operated on the reversed vectors directly.
>
> Another thing we'd like for SVE in general is to allow vectors *with*
> gaps throughout the SLP graph, since with predication it's simple to
> ignore any inactive element where necessary.  This is often much cheaper
> than packing the active elements together to remove gaps.  For example:
> 
>   a[0] += 1;
>   a[2] += 1;
>   a[3] += 1;
> 
> should just be a predicated load, add and store, with no shuffling.
> 
> Even on targets without predication, it might be better to leave
> the gaps in for part of the graph, e.g. if two loads feed a single
> permute.  So it would be good if the node representation allowed
> arbitrary permutations, including with “dead” elements at any
> position.  (I realise that isn't news and that keeping the gap
> removal with the load was just “for now” :-))

Hmm, OK.  So I'm still experimenting with how to incrementally push
these kind of changes to master.  Unfortunately it resists any
"serious" change because we've painted us into a corner with respect
to load and store handling ;)  Well, it just means the change will
need to be bigger than anticipated.

As for inactive lanes, for SVE this means a mask register for each
operation, correct?  At the moment the SLP graph is a representation
of the scalar GIMPLE IL and we cannot really change that until we
commit to a vectorization factor and more.  So an inactive lane
is simply not there and a "load" is simply another way of building
up a vector from scalar stmt results much like those "built from scalars"
SLP nodes but we of course make them special because we have those
DR groups that are used.

So with the patch as posted the "gaps" are represented in the
load permutation of the "loads" which is where you could create
mask registers from and simply push them to SLP parents (up to
the point you get to a parent whose childs have differing masks...).

I think we're going to 

Re: [PATCH] Split load permutation

2020-06-30 Thread Richard Sandiford
Sorry for the slow reponse…

Richard Biener  writes:
> This splits SLP load nodes with load permutation into a SLP
> load node (with load "permutation" removing gaps) and a
> lane permutation node.  The first and foremost goal of this
> is to be able to have a single SLP node for each load group
> so we can start making decisions about how to vectorize
> them factoring in all loads of that group.  The second
> goal is to eventually be able to optimize permutations by
> pushing them down where they can be combined from multiple
> children to the output.  We do have one very special case
> handled by vect_attempt_slp_rearrange_stmts doing it all
> the way down for reductions that are associative.

Sounds great!

> For example for
>
>   l1 = a[0]; l2 = a[1];
>   b[0] = l1; b[1] = l2;
>   c[0] = l2; c[1] = l1;
>
> wecan avoid generating loads twice.  For
>
>   l1 = a[0]; l2 = a[1]; l3 = a[2];
>   b[0] = l1; b[1] = l2;
>  c[0] = l2; c[1] = l3;
>
> we will have a SLP load node with three lanes plus
> two lane permutation nodes selecting two lanes each.  In
> a loop context this will cause a VF of two and three
> loads per vectorized loop iteration (plus two permutes)
> while previously we used a VF of one with two loads
> and no permutation per loop iteration.  In the new
> scheme the number of loads is less but we pay with
> permutes and a higher VF.
>
> There is (bad) interaction with determining of the vectorization
> factor which for BB vectorization causes missed vectorizations
> because the "load parts of a dataref group directly" optimization
> is not (yet) reflected in the SLP graph.
>
> There is (bad) interaction with CTOR vectorization since we now
> get confused about where to insert vectorized stmts when combining
> two previously distinct SLP graphs.
>
> My immediate focus is on the SSA verification FAILs but the
> other part points at a missing piece in this - a "pass"
> that "optimizes" the SLP graph with respect to permutations
> and loads, ultimatively for example deciding between using
> interleaving schemes, scalar loads, "SLP" + permutation,
> load-lanes, etc.;  This is also the thing that blocks
> SLP only (apart from teaching the few pieces that cannot do SLP
> to do SLP).
>
> I'm posting this mostly to make people think how it fits
> their future development and architecture features.

Yeah, the interleaving scheme is something we'd very much like for SVE2,
where for some operations that produce multiple vectors, it would be better
to organise them on an even/odd division instead of a low/high division.
There are instructions both to produce and to consume values in odd/even
form, so in many cases no explicit permutation would be needed.

I've also seen cases for downward loops where we reverse vectors after
loading and reverse them again before storing, even though the loop
could easily have operated on the reversed vectors directly.

Another thing we'd like for SVE in general is to allow vectors *with*
gaps throughout the SLP graph, since with predication it's simple to
ignore any inactive element where necessary.  This is often much cheaper
than packing the active elements together to remove gaps.  For example:

  a[0] += 1;
  a[2] += 1;
  a[3] += 1;

should just be a predicated load, add and store, with no shuffling.

Even on targets without predication, it might be better to leave
the gaps in for part of the graph, e.g. if two loads feed a single
permute.  So it would be good if the node representation allowed
arbitrary permutations, including with “dead” elements at any
position.  (I realise that isn't news and that keeping the gap
removal with the load was just “for now” :-))

I guess this to some extent feeds into a long-standing TODO to allow
“don't care” elements in things like vec_perm_indices.

Thanks,
Richard


[PATCH] Split load permutation

2020-06-19 Thread Richard Biener


This splits SLP load nodes with load permutation into a SLP
load node (with load "permutation" removing gaps) and a
lane permutation node.  The first and foremost goal of this
is to be able to have a single SLP node for each load group
so we can start making decisions about how to vectorize
them factoring in all loads of that group.  The second
goal is to eventually be able to optimize permutations by
pushing them down where they can be combined from multiple
children to the output.  We do have one very special case
handled by vect_attempt_slp_rearrange_stmts doing it all
the way down for reductions that are associative.

For example for

  l1 = a[0]; l2 = a[1];
  b[0] = l1; b[1] = l2;
  c[0] = l2; c[1] = l1;

wecan avoid generating loads twice.  For

  l1 = a[0]; l2 = a[1]; l3 = a[2];
  b[0] = l1; b[1] = l2;
 c[0] = l2; c[1] = l3;

we will have a SLP load node with three lanes plus
two lane permutation nodes selecting two lanes each.  In
a loop context this will cause a VF of two and three
loads per vectorized loop iteration (plus two permutes)
while previously we used a VF of one with two loads
and no permutation per loop iteration.  In the new
scheme the number of loads is less but we pay with
permutes and a higher VF.

There is (bad) interaction with determining of the vectorization
factor which for BB vectorization causes missed vectorizations
because the "load parts of a dataref group directly" optimization
is not (yet) reflected in the SLP graph.

There is (bad) interaction with CTOR vectorization since we now
get confused about where to insert vectorized stmts when combining
two previously distinct SLP graphs.

My immediate focus is on the SSA verification FAILs but the
other part points at a missing piece in this - a "pass"
that "optimizes" the SLP graph with respect to permutations
and loads, ultimatively for example deciding between using
interleaving schemes, scalar loads, "SLP" + permutation,
load-lanes, etc.;  This is also the thing that blocks
SLP only (apart from teaching the few pieces that cannot do SLP
to do SLP).

I'm posting this mostly to make people think how it fits
their future development and architecture features.

And of course to see whether there are already made up ideas
how to structure such "optimization".

Thanks,
Richard.

PS: patch doesn't bootstrap, it ICEs during libgfortran build.
C vect.exp FAILs are

FAIL: gcc.dg/vect/pr59354.c (internal compiler error)
FAIL: gcc.dg/vect/pr88903-2.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-10.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-11.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-12.c (internal compiler error)
FAIL: gcc.dg/vect/slp-23.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 2
FAIL: gcc.dg/vect/slp-43.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-19.c scan-tree-dump-times slp2 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-29.c scan-tree-dump-times slp1 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization 
with gaps at the end of a load is not supported" 1
FAIL: gcc.dg/vect/bb-slp-subgroups-2.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-subgroups-3.c (internal compiler error)

* tree-vectorizer.h (vect_get_place_in_interleaving_chain):
Add flag paramter defaulted to true.
* tree-vect-slp.c (vect_get_place_in_interleaving_chain):
Imlement alternate mode not counting gaps.
(vect_build_slp_tree_2): Split load nodes into load and
lane permutation.
(vect_attempt_slp_rearrange_stmts): Pass in the unroll
factor to consider.
(calculate_unrolling_factor): New function.
(vect_analyze_slp_instance): Adjust.
(vect_analyze_slp): Nail down the unroll factor before
optimizing permutations in SLP reductions.
(vect_make_slp_decision): Remove determining unroll
factor here.
(vect_schedule_slp_instance): Adjust where we emit vectorized
stmts.
---
 gcc/tree-vect-slp.c   | 246 +-
 gcc/tree-vectorizer.h |   3 +-
 2 files changed, 199 insertions(+), 50 deletions(-)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 5c169f37022..68b0750db44 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -244,7 +244,8 @@ vect_contains_pattern_stmt_p (vec stmts)
 
 int
 vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,
- stmt_vec_info first_stmt_info)
+ stmt_vec_info first_stmt_info,
+ bool add_gaps)
 {
   stmt_vec_info next_stmt_info = first_stmt_info;
   int result = 0;
@@ -258,7 +259,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info 
stmt_info,
return result;
   next_stmt_info =