Re: Add optabs for common types of permutation

Richard Sandiford Thu, 23 Nov 2017 03:18:05 -0800

Richard Biener <richard.guent...@gmail.com> writes:
> On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford
> <richard.sandif...@linaro.org> wrote:
>> Richard Biener <richard.guent...@gmail.com> writes:
>>> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
>>> <richard.sandif...@linaro.org> wrote:
>>>> Richard Biener <richard.guent...@gmail.com> writes:
>>>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <l...@redhat.com> wrote:
>>>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>>>> handle.
>>>>>>>
>>>>>>> The vector optabs are inconsistent about whether there should be
>>>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>>>> optabs have one.
>>>>>>>
>>>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>>>> patches.
>>>>>>>
>>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>>>
>>>>>>> Richard
>>>>>>>
>>>>>>>
>>>>>>> 2017-11-09  Richard Sandiford  <richard.sandif...@linaro.org>
>>>>>>>           Alan Hayward  <alan.hayw...@arm.com>
>>>>>>>           David Sherwood  <david.sherw...@arm.com>
>>>>>>>
>>>>>>> gcc/
>>>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>>>       functions.
>>>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, 
>>>>>>> vec_reverse_optab):
>>>>>>>       New optabs.
>>>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>>>       (vect_grouped_store_supported): Try using 
>>>>>>> IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>>>       (vect_grouped_load_supported): Try using 
>>>>>>> IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>>>       (get_negative_load_store_type): Use it.
>>>>>>>       (reverse_vector): New function.
>>>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New 
>>>>>>> expander.
>>>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>>>
>>>>>>> gcc/testsuite/
>>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>>>> OK.
>>>>>
>>>>> It's really a step backwards - we had those optabs and a tree code in
>>>>> the past and
>>>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>>>
>>>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>>>
>>>> The problems with that are:
>>>>
>>>> - It doesn't work for vectors with 256-bit elements because the indices
>>>>   wrap round.
>>>
>>> That's a general issue that would need to be addressed for larger
>>> vectors (GCN?).
>>> I presume the requirement that the permutation vector have the same size
>>> needs to be relaxed.
>>>
>>>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>>>   special cases would be hard, especially since v256hi isn't a normal
>>>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>>>   then have to worry about that special case.
>>>
>>> I think it's not really a special case - any code here should just
>>> expect the same
>>> number of vector elements and not a particular size.  You already dealt with
>>> using a char[] vector for permutations I think.
>>
>> It sounds like you're talking about the case in which the permutation
>> vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
>> vectors, so that doesn't change.  (And yes, that probably means that it
>> does break for *fixed-length* 2048-bit vectors.)
>>
>> But this patch is about the variable-length case, in which the
>> permutation vector is never a VECTOR_CST, and couldn't get converted
>> to a vec_perm_indices array.  As far as existing code is concerned,
>> it's no different from a VEC_PERM_EXPR with a variable permutation
>> vector.
>
> But the permutation vector is constant as well - this is what you added those
> VEC_SERIES_CST stuff and whatnot for.
>
> I don't want variable-size vector special-casing everywhere.  I want it to be
> somehow naturally integrating with existing stuff.


It's going to be a special case whatever happens though.  If it's a
VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.  The advantage
of the internal functions and optabs is that they map to a concept that
already exists.  The code that generates the permutation already knows that
it's generating an interleave lo/hi, and like you say, it used to do that
directly via special tree codes.  I agree that having a VEC_PERM_EXPR makes
more sense for the constant-length case, but the concept is still there.

And although using VEC_PERM_EXPR in gimple makes sense, I think not
having the optabs is a step backwards, because it means that every
target with interleave lo/hi has to duplicate the detection logic.

>> So by taking this approach, we'd effectively be committing to supporting
>> VEC_PERM_EXPRs with variable permutation vectors that that are wider than
>> the vectors being permuted.  Those permutation vectors will usually not
>> have a vector_mode_supported_p mode and will have to be synthesised
>> somehow.  Trying to support the general case like this could be incredibly
>> expensive.  Only certain special cases like interleave hi/lo could be
>> handled cheaply.
>
> As far as I understand SVE only supports interleave / extract even/odd anyway.
>
>>>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>>>   I guess it's possible to represent that using a combination of
>>>>   shifts, masks, and additions, but then:
>>>>
>>>>   1) when generating them, we'd need to make sure that we cost the
>>>>      operation as a single permute, rather than costing all the shifts,
>>>>      masks and additions
>>>>
>>>>   2) we'd need to make sure that all gimple optimisations that run
>>>>      afterwards don't perturb the sequence, otherwise we'll end up
>>>>      with something that's very expensive.
>>>>
>>>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>>>      need to re-recognise the sequence first.
>>>>
>>>>   4) expand would need to re-recognise the sequence and use the
>>>>      optab anyway.
>>>
>>> Well, the answer is of course that you just need a more powerful
>>> VEC_SERIES_CST
>>> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
>>> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
>>> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify 
>>> it.
>>
>> It can do lots of other things too :-)  But in all cases as separate
>> statements.  It seems better to expose them as separate statements in
>> gimple so that they get optimised by the more powerful gimple optimisers,
>> rather than waiting until rtl.
>>
>> I think if we go down the route of building more and more operations into
>> the constant, we'll end up inventing a gimple version of rtx CONST.
>>
>> I also don't see why it's OK to expose the concept of interleave hi/lo
>> as an operation on constants but not as an operation on general vectors.
>
> I'm not suggesting to expose it as an operation.  I'm suggesting that if the
> target can vec_perm_const_ok () with an "interleave/extract" permutation
> then we should be able to represent that with VEC_PERM_EXPR and thus
> also represent the permutation vector.

But vec_perm_const_ok () takes a fixed-length mask, so it can't be
used here.  It would need to be a new hook (and thus a new special
case for variable-length vectors).

> I wasn't too happy with VEC_SERIES_CST either you know.
>
> As said, having something as disruptive as poly_int everywhere but then
> still need all those special casing for variable length vectors in the
> vectorizer
> looks just wrong.
>
> How are you going to handle __builtin_shuffle () with SVE & intrinsics?

The variable case should work with the current constraints, i.e. with
the permutation vector having the same element width as the vectors
being permuted, once there's a way of writing __builtin_shuffle with
variable-length vectors.  That means that 256-element shuffles can't
refer to the second vector, but that's correct.

(The problem here is that the interleaves *do* need to refer to
the second vector, but using wider vectors for the permutation vector
wouldn't be a native operation in general for SVE or for any existing
target.)

> How are you going to handle generic vector "lowering"?

We shouldn't generate variable-length vectors that don't exist on the
target (and there's no syntax for doing that in C).

> All the predication stuff is hidden from the middle-end as well.  It would
> have been nice to finally have a nice way to express these things in GIMPLE.

Not sure what you mean here.  The only time predication is hidden is
when SVE requires an all-true predicate for a full-vector operation,
which seems like a target-specific detail.  All "real" predication
is exposed in GIMPLE.  It builds on the existing support for vector
boolean types.

> Bah.
>
>>> Suggested fix: add a interleaved bit to VEC_SERIES_CST.
>>
>> That only handles this one case though.  We'd have to keep making it
>> more and more complicated as more cases come up.  E.g. the extra bit
>> couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.
>
> But is there an instruction for this in SVE?  I understand there's a single
> instruction doing interleave low/high and extract even/odd?

This specific example, no.  But my point is that...

> But is there more?  Possibly a generic permute but for it you'd have
> to explicitely construct a permutation vector using some primitives
> like that "series" instruction?  So for that case it's reasonable to
> have GIMPLE like
>
>  perm_vector_1 = VEC_SERIES_EXRP <...>
>  ...
>  v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;
>
> that is, it's not required to pretend the VEC_PERM_EXRP is a single
> instruction or the permutation vector is "constant"?

...by taking this approach, we're saying that we need to ensure that
there is always a way of representing every directly-supported variable-
length permutation mask as a constant, so that it doesn't get split from
VEC_PERM_EXPR.  I don't see why that's better than having internal
functions.  You said that you don't like the extra constants, and each
time we make the constants more complicated, we have to support the
more complicated constants everywhere that handles the constants
(rather than everywhere that handles the VEC_PERM_EXPRs).

>>> At least I'd like to see it used for the cases it can already handle.
>>>
>>> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
>>> handle some cases it needs to be fixed / its constraints relaxed (like
>>> the v256qi case).
>>
>> I don't think we gain anything by shoehorning everything into one code
>> for the variable-length case though.  None of the existing vec_perm_const
>> code can (or should) be used, and none of the existing VECTOR_CST-based
>> VEC_PERM_EXPR handling will do anything.  That accounts for the majority
>> of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
>> VECTOR_CST, there won't be any vector types that use a mixture of
>> current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.
>
> I dislike having those as you know.
>
>> So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
>> permute functions, we wouldn't get any new optimisations for free: we'd
>> have to write new code to match the new constants.
>
> Yeah, too bad we have those new constants ;)
>
>>  So it becomes a
>> question of whether it's easier to do that on VEC_SERIES_CSTs or
>> internal functions.  I think match.pd makes it much easier to optimise
>> internal functions, and you get the added benefit that the result is
>> automatically checked against what the target supports.  And the
>> direct mapping to optabs means that no re-recognition is necessary.
>>
>> It also seems inconsistent to allow things like TARGET_MEM_REF vs.
>> MEM_REF but still require a single gimple permute operation, regardless
>> of circumstances.  I thought we were trying to move in the other direction,
>> i.e. trying to get the power of the gimple optimisers for things that
>> were previously handled by rtl.
>
> Indeed I don't like TARGET_MEM_REF too much either.

Hmm, ok.  Maybe that's the difference here.  It seems like a really
nice feature to me :-)

>>>>   Using an internal function seems much simpler :-)
>>>>
>>>> I think VEC_PERM_EXPR is useful because it represents the same
>>>> operation as __builtin_shuffle, and we want to optimise that as best
>>>> we can.  But these internal functions are only used by the vectoriser,
>>>> which should always see what the final form of the permute should be.
>>>
>>> You hope so.  We have several cases where later unrolling and CSE/forwprop
>>> optimize permutations away.
>>
>> Unrolling doesn't usually expose anything useful for variable-length
>> vectors though, since the iv step is also variable.  I guess it could
>> still happen, but TBH I'd rather take the hit of that than the risk
>> that optimisers could create expensive non-native permutes.
>
> optimizers / folders have to (and do) check vec_perm_const_ok if they
> change a constant permute vector.
>
> Note there's always an advantage of exposing target capabilities directly,
> like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be
> expanded to the (series of!) native "permute" instructions of x86 by adding
> (target specific!) IFNs.  But then those would be black boxes to all followup
> optimizers which means we could as well have none of those.  But fact is
> the vectorizer isn't perfect and we rely on useless permutes being
> a) CSEd, b) combined, c) eliminated against extracts, etc.  You'd have to
> replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd
> for all of the target IFNs.  Yes, if we were right before RTL expansion we can
> have those "target IFNs" immediately but fact is we do a _lot_ of 
> optimizations
> after vectorization.
>
> Oh, and there I mentioned "target IFNs" (and related, "target match.pd").
> You are adding IFNs that exist for each target (because that's how optabs
> work(?)) but in reality you are generating ones that match SVE.  Not
> very nice either.

But the concepts are general, even if they're implemented by only
one target at the moment.  One architecture always has to come first.

E.g. when IFN_MASK_LOAD went in, it was only supported for x86_64.
Adding it as a generic function was still the right thing to do and
meant that all SVE had to do was define the optab.

I think target IFNs would only make sense if we have some sort
of pre-expand target-specific lowering pass.  (Which might be
a good thing.)  Here we're adding internal functions for things
that the vectoriser has to be aware of.

> That said, all this feels a bit like a hack throughout of GCC rather
> than designing variable-length vectors into GIMPLE and then providing
> some implementation meat in the arm backend(s).  I know you're time
> constrained but I think we're carrying a quite big maintainance burden
> that will be very difficult to "fix" afterwards (because of lack of
> motivation once this is in).  And we've not even seen SVE silicon...
> (happened with SSE5 for example but that at least was x86-only).

I don't think it's a hack, and it didn't end up this way because
of time constraints.  IMO designing variable-length vectors into
gimple means that (a) it needs to be possible to create variable-
length vector constants in gimple (hence the new constants) and
(b) gimple optimisers need to be aware of the fact that vectors
have a variable length, element offsets can variable, etc.
(hence the poly_int stuff, which is also needed for rtl).

Thanks,
Richard

Re: Add optabs for common types of permutation

Reply via email to