On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Sorry for the slow reponse…

No problem.

> Richard Biener <rguent...@suse.de> writes:
> > This splits SLP load nodes with load permutation into a SLP
> > load node (with load "permutation" removing gaps) and a
> > lane permutation node.  The first and foremost goal of this
> > is to be able to have a single SLP node for each load group
> > so we can start making decisions about how to vectorize
> > them factoring in all loads of that group.  The second
> > goal is to eventually be able to optimize permutations by
> > pushing them down where they can be combined from multiple
> > children to the output.  We do have one very special case
> > handled by vect_attempt_slp_rearrange_stmts doing it all
> > the way down for reductions that are associative.
> 
> Sounds great!
> 
> > For example for
> >
> >   l1 = a[0]; l2 = a[1];
> >   b[0] = l1; b[1] = l2;
> >   c[0] = l2; c[1] = l1;
> >
> > wecan avoid generating loads twice.  For
> >
> >   l1 = a[0]; l2 = a[1]; l3 = a[2];
> >   b[0] = l1; b[1] = l2;
> >              c[0] = l2; c[1] = l3;
> >
> > we will have a SLP load node with three lanes plus
> > two lane permutation nodes selecting two lanes each.  In
> > a loop context this will cause a VF of two and three
> > loads per vectorized loop iteration (plus two permutes)
> > while previously we used a VF of one with two loads
> > and no permutation per loop iteration.  In the new
> > scheme the number of loads is less but we pay with
> > permutes and a higher VF.
> >
> > There is (bad) interaction with determining of the vectorization
> > factor which for BB vectorization causes missed vectorizations
> > because the "load parts of a dataref group directly" optimization
> > is not (yet) reflected in the SLP graph.
> >
> > There is (bad) interaction with CTOR vectorization since we now
> > get confused about where to insert vectorized stmts when combining
> > two previously distinct SLP graphs.
> >
> > My immediate focus is on the SSA verification FAILs but the
> > other part points at a missing piece in this - a "pass"
> > that "optimizes" the SLP graph with respect to permutations
> > and loads, ultimatively for example deciding between using
> > interleaving schemes, scalar loads, "SLP" + permutation,
> > load-lanes, etc.;  This is also the thing that blocks
> > SLP only (apart from teaching the few pieces that cannot do SLP
> > to do SLP).
> >
> > I'm posting this mostly to make people think how it fits
> > their future development and architecture features.
> 
> Yeah, the interleaving scheme is something we'd very much like for SVE2,
> where for some operations that produce multiple vectors, it would be better
> to organise them on an even/odd division instead of a low/high division.
> There are instructions both to produce and to consume values in odd/even
> form, so in many cases no explicit permutation would be needed.
> 
> I've also seen cases for downward loops where we reverse vectors after
> loading and reverse them again before storing, even though the loop
> could easily have operated on the reversed vectors directly.
>
> Another thing we'd like for SVE in general is to allow vectors *with*
> gaps throughout the SLP graph, since with predication it's simple to
> ignore any inactive element where necessary.  This is often much cheaper
> than packing the active elements together to remove gaps.  For example:
> 
>   a[0] += 1;
>   a[2] += 1;
>   a[3] += 1;
> 
> should just be a predicated load, add and store, with no shuffling.
> 
> Even on targets without predication, it might be better to leave
> the gaps in for part of the graph, e.g. if two loads feed a single
> permute.  So it would be good if the node representation allowed
> arbitrary permutations, including with “dead” elements at any
> position.  (I realise that isn't news and that keeping the gap
> removal with the load was just “for now” :-))

Hmm, OK.  So I'm still experimenting with how to incrementally push
these kind of changes to master.  Unfortunately it resists any
"serious" change because we've painted us into a corner with respect
to load and store handling ;)  Well, it just means the change will
need to be bigger than anticipated.

As for inactive lanes, for SVE this means a mask register for each
operation, correct?  At the moment the SLP graph is a representation
of the scalar GIMPLE IL and we cannot really change that until we
commit to a vectorization factor and more.  So an inactive lane
is simply not there and a "load" is simply another way of building
up a vector from scalar stmt results much like those "built from scalars"
SLP nodes but we of course make them special because we have those
DR groups that are used.

So with the patch as posted the "gaps" are represented in the
load permutation of the "loads" which is where you could create
mask registers from and simply push them to SLP parents (up to
the point you get to a parent whose childs have differing masks...).

I think we're going to have a set of post-processing steps on the
initial SLP graph for both "optimizing" where (and if) to do permutations
and whether to elide gap removal in favor of masks (and we'd then
annotate the SLP nodes with the active mask).

All of this requires pushing back some of the early decisions
(I've mostly identified vectorization factor determining) but also
do load/store classification early.  For example if we end up
with strided loads or stores such operations can fuse with any
permutation at no cost.

At the moment I'm continuing to poke the code for its least
resistance for introducing at least parts of the machinery.
I'm targeting post-processing for merging of identical
permutes across operations like it appears for

  a[0] = b[1] + c[1];
  a[1] = b[0] + c[0];

> I guess this to some extent feeds into a long-standing TODO to allow
> “don't care” elements in things like vec_perm_indices.

Hmm, so when there's no masking and we don't want to remove gaps
we'd replace the permutation removing the gaps with an operation
that turns the inactive lanes to zero (bitwise and) or all ones
(bitwise or).  Tracking the number of "vector" lanes compared
to the number of "real scalar" lanes in a SLP node might be enough
to handle this more or less transparently.

Richard.

Reply via email to