Re: RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support

Richard Biener Thu, 15 Nov 2018 04:17:55 -0800

On Wed, Nov 14, 2018 at 4:21 PM Joern Wolfgang Rennecke
<joern.renne...@riscy-ip.com> wrote:
>
>
> On 14/11/18 09:53, Richard Biener wrote:
> >> WIDEN_MULT_PLUS is special on our target in that it creates double-sized
> >> vectors.
> > Are there really double-size vectors or does the target simply produce
> > the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH
> > or _EVEN/ODD split operations.  Or, like - what I now remember - for the
> > DOT_PROD_EXPR optab, the target already reduces element pairs of the
> > result vector (unspecified which ones) so the result vector is of the same
> > size as the inputs.
> The output of widening multiply and widening multiply-add is stored
> in two consecutive registers.  So, they can be used as separate
> vectors, but you can't choose the register numbers indepenndently.
> OTOH, you can treat them together as a double-sized vector, but
> without any extra alignment requirements over a single-sized vector.
> >
> > That is, if your target produces two vectors you may instead want to hide
> > that fact by claiming you support DOT_PROD_EXPR and expanding
> > it to the widen-mult-plus plus reducing (adding) the two result vectors to
> > get a single one.
>
> Doing a part of the reduction in the loop is a bit pointless.


The advantage is you can work with lower vectorization factor and thus
less unrolling (smaller code).

> I have tried another approach, to advertize the WIDEN_MULT_PLUS
> and WIDEN_MULT operations as LO/HI part operations of the double
> vector size, and also add fake double-vector patterns for move, widening
> and add (they get expanded or splitted to single-vector patterns).
> That seems to work for the dot product, it's like the code is unrolled by
> a factor of two.

Yeah, that would work as well.  As said, the vectorizer cannot really
handle the doubled vector size (means: you sooner or later will run into
issues).

>  There are a few drawbacks though:
> - the tree optimizer creates separate WIDEN_MULT and PLUS expressions,
> and it is left to the combiner to clean that up.  That combination and
> register allocation
> might be a bit fragile.
> - if the input isn't known to be aligned to the doubled vector size, a
> run-time
> check is inserted to use an unvectorized loop if there is no excess
> alignment.
> - auto-increment for the loads is lost.  I can probably fix this by
> keeping double-sized
> loads around for longer or with some special-purpose pass, but both
> might have
> some other drawbacks.  But there's actually a configuration option for
> an instruction
> to load multiple vector registers with register-indirect or
> auto-increment, so there is
> some merit to have a pattern for it.
> - the code size is larger.
> - vectorization will fail if any other code is mixed in for which no
> double-vector patterns are provided.
> - this approach uses SUBREGs in ways that are not safe according to the
> documentation.
> But then, other ports like i386 and aarch64-little endian do that too.

I think they do it if they really have such instructions in the ISA
(they also have those that do the in-loop reduction to half of the result
vector size -- DOT_PROD_EXPR).

> I think it is now (since we have
> SUBREG_BYTE) safe to have subregs of registers with hard reg sizes
> larger than UNITS_PER_WORD,
> as long as you refer to entire hard registers.  Maybe we could change
> the documentation?
> AFAICT, there are also only four places that need to be patched to make
> a lowpart access with a SUBREG of such a hard register safe. I'm trying
> this at the moment, it was justa few hours late for the
> phase 1->3 deadline.
>
> I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized
> pattern that
> adds the products of the two input vectors into the double output
> vector, and leave
> the rtl loop optimizer to get the constant pool load of the all-ones
> vector out of
> the loop.  But again, there'll be issues with excess alignment
> requirements and code size.

I think going the DOT_PROD_EXPR way is a lot easier.  You simply
expand the additional (in-loop) sum.  The only drawback I see is that
this might be slower code.

So yes the a _LO/_HI way maps better to hardware but you rely on
CSE to remove the redundant instruction if you implement _LO/_HI
as doing the full operation and just taking one of the result vectors.

> >
> > The vectorizer cannot really deal with multiple sizes, thus for example
> > a V4SI * V4SI + V4DI operation and that all those tree codes are exposed
> > as "scalar" is sth that continues to confuse me but is mainly done because
> > at pattern recognition time there's only the scalars.
> Well, the vectorizer makes an exception for reductions as it'll allow to
> maintain
> either a vector or a scalar during the loop, so why not allow other
> sizes for that
> value as well?

It's not implemented ;)

>  It's all hidden in the final reduction emitted by the
> epilogue.
> > For vectorization I would advise to provide expansion patterns for
> > codes that are already supported, in your case DOT_PROD_EXPR.
> With vector size doubling, it seems to work better with LO/HI multiply
> and PLUS (and let
> the combiner take the strain).
> without... for a straight expansion, there is little point.  The
> previous sum is in one
> register, the multiply results are spread over two registers, and
> DOT_PROD_EXPR is supposed
> to yield a scalar.  Even with a reduction instruction to sum up two
> registers, you need another
> instruction to add up all three, so a minimum of three instructions.

No, DOT_PROD_EXPR yields a vector of the same size as the inputs.
That means it has to reduce the N element result vector to a M element
one to match that constraint.  For example on x86 pmaddw is an
instruction that does this.

That is, the overhead for you is doing a single vector add to combine
the two vector results to one.

> LO/HI mulltiply can
> be fudged by doing a full multiply and picking half the result, and cse
> should reduce that
> to one multiply.  Again, two adds needed, because the reduction variable
> is too narrow
> to use widening multiply-add.
> There maybe some merit to DOT_PROD_EXPR if I make it do something strange.
> But there's no easy way to use a special purpose mode, since there's no
> matching reduction
> pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is
> not readily
> distinguishable from the one for a non-widening summation with the same
> output vector mode.
> I could use a special kind of hard register that's really another view
> of a group of vector registers
> and which are reserved for this purpose unless eliminated, and the
> elimination is blocked when
> there is a statement that uses these registers because the expander for
> the DOT_PROD_EXPR /
> WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if
> they special 'hard
> reg' can't be obtained another, more expensive pattern (suitably
> indicated in the constraints) is
> used... but that's a lot of hair.
> It's probably easier to write a special-purpose ssa pass to patch up the
> type of the reduction variable,
> and insert that pass to run after the vectorizer.  widen the variable
> when entering the loop,
> reduce it when exiting. if the loop is not understood, a more expensive
> pattern with standard
> reduction variable width is used.
> In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they
> are somewhat special
> and thus stick out (or in other words, you can take a bit of time to
> verify you got something interesting
> when you find them).
>

Re: RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support

Reply via email to