On Wed, Nov 14, 2018 at 4:21 PM Joern Wolfgang Rennecke <joern.renne...@riscy-ip.com> wrote: > > > On 14/11/18 09:53, Richard Biener wrote: > >> WIDEN_MULT_PLUS is special on our target in that it creates double-sized > >> vectors. > > Are there really double-size vectors or does the target simply produce > > the output in two vectors? Usually targets have WIDEN_MULT_PLUS_LOW/HIGH > > or _EVEN/ODD split operations. Or, like - what I now remember - for the > > DOT_PROD_EXPR optab, the target already reduces element pairs of the > > result vector (unspecified which ones) so the result vector is of the same > > size as the inputs. > The output of widening multiply and widening multiply-add is stored > in two consecutive registers. So, they can be used as separate > vectors, but you can't choose the register numbers indepenndently. > OTOH, you can treat them together as a double-sized vector, but > without any extra alignment requirements over a single-sized vector. > > > > That is, if your target produces two vectors you may instead want to hide > > that fact by claiming you support DOT_PROD_EXPR and expanding > > it to the widen-mult-plus plus reducing (adding) the two result vectors to > > get a single one. > > Doing a part of the reduction in the loop is a bit pointless.
The advantage is you can work with lower vectorization factor and thus less unrolling (smaller code). > I have tried another approach, to advertize the WIDEN_MULT_PLUS > and WIDEN_MULT operations as LO/HI part operations of the double > vector size, and also add fake double-vector patterns for move, widening > and add (they get expanded or splitted to single-vector patterns). > That seems to work for the dot product, it's like the code is unrolled by > a factor of two. Yeah, that would work as well. As said, the vectorizer cannot really handle the doubled vector size (means: you sooner or later will run into issues). > There are a few drawbacks though: > - the tree optimizer creates separate WIDEN_MULT and PLUS expressions, > and it is left to the combiner to clean that up. That combination and > register allocation > might be a bit fragile. > - if the input isn't known to be aligned to the doubled vector size, a > run-time > check is inserted to use an unvectorized loop if there is no excess > alignment. > - auto-increment for the loads is lost. I can probably fix this by > keeping double-sized > loads around for longer or with some special-purpose pass, but both > might have > some other drawbacks. But there's actually a configuration option for > an instruction > to load multiple vector registers with register-indirect or > auto-increment, so there is > some merit to have a pattern for it. > - the code size is larger. > - vectorization will fail if any other code is mixed in for which no > double-vector patterns are provided. > - this approach uses SUBREGs in ways that are not safe according to the > documentation. > But then, other ports like i386 and aarch64-little endian do that too. I think they do it if they really have such instructions in the ISA (they also have those that do the in-loop reduction to half of the result vector size -- DOT_PROD_EXPR). > I think it is now (since we have > SUBREG_BYTE) safe to have subregs of registers with hard reg sizes > larger than UNITS_PER_WORD, > as long as you refer to entire hard registers. Maybe we could change > the documentation? > AFAICT, there are also only four places that need to be patched to make > a lowpart access with a SUBREG of such a hard register safe. I'm trying > this at the moment, it was justa few hours late for the > phase 1->3 deadline. > > I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized > pattern that > adds the products of the two input vectors into the double output > vector, and leave > the rtl loop optimizer to get the constant pool load of the all-ones > vector out of > the loop. But again, there'll be issues with excess alignment > requirements and code size. I think going the DOT_PROD_EXPR way is a lot easier. You simply expand the additional (in-loop) sum. The only drawback I see is that this might be slower code. So yes the a _LO/_HI way maps better to hardware but you rely on CSE to remove the redundant instruction if you implement _LO/_HI as doing the full operation and just taking one of the result vectors. > > > > The vectorizer cannot really deal with multiple sizes, thus for example > > a V4SI * V4SI + V4DI operation and that all those tree codes are exposed > > as "scalar" is sth that continues to confuse me but is mainly done because > > at pattern recognition time there's only the scalars. > Well, the vectorizer makes an exception for reductions as it'll allow to > maintain > either a vector or a scalar during the loop, so why not allow other > sizes for that > value as well? It's not implemented ;) > It's all hidden in the final reduction emitted by the > epilogue. > > For vectorization I would advise to provide expansion patterns for > > codes that are already supported, in your case DOT_PROD_EXPR. > With vector size doubling, it seems to work better with LO/HI multiply > and PLUS (and let > the combiner take the strain). > without... for a straight expansion, there is little point. The > previous sum is in one > register, the multiply results are spread over two registers, and > DOT_PROD_EXPR is supposed > to yield a scalar. Even with a reduction instruction to sum up two > registers, you need another > instruction to add up all three, so a minimum of three instructions. No, DOT_PROD_EXPR yields a vector of the same size as the inputs. That means it has to reduce the N element result vector to a M element one to match that constraint. For example on x86 pmaddw is an instruction that does this. That is, the overhead for you is doing a single vector add to combine the two vector results to one. > LO/HI mulltiply can > be fudged by doing a full multiply and picking half the result, and cse > should reduce that > to one multiply. Again, two adds needed, because the reduction variable > is too narrow > to use widening multiply-add. > There maybe some merit to DOT_PROD_EXPR if I make it do something strange. > But there's no easy way to use a special purpose mode, since there's no > matching reduction > pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is > not readily > distinguishable from the one for a non-widening summation with the same > output vector mode. > I could use a special kind of hard register that's really another view > of a group of vector registers > and which are reserved for this purpose unless eliminated, and the > elimination is blocked when > there is a statement that uses these registers because the expander for > the DOT_PROD_EXPR / > WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if > they special 'hard > reg' can't be obtained another, more expensive pattern (suitably > indicated in the constraints) is > used... but that's a lot of hair. > It's probably easier to write a special-purpose ssa pass to patch up the > type of the reduction variable, > and insert that pass to run after the vectorizer. widen the variable > when entering the loop, > reduce it when exiting. if the loop is not understood, a more expensive > pattern with standard > reduction variable width is used. > In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they > are somewhat special > and thus stick out (or in other words, you can take a bit of time to > verify you got something interesting > when you find them). >