On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: > Hi Alan, > > I probably wasn't clear enough. The implementation in the vectorizer is > fine and I'm not asking that to change per target. What I'm objecting > to is the equivalence between a REDUC_MAX_EXPR and a cost associated > with vec_to_scalar. This assumes that the back end will implement a > REDUC_MAX_EXPR in a specific way that at least some back ends cannot. > But those back ends should be free to model the cost of the > REDUC_MAX_EXPR appropriately. Therefore I am asking for a new > vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR. For > ARM, this cost will be the same as a vec_to_scalar. For others, it may > not be; for powerpc, it certainly will not be.
> > We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL > expansion, and therefore it is not correct for us to explode this in > tree-vect-generic. This would expand the code size without providing > any significant optimization opportunity, and could reduce the ability > to, for instance, common REDUC_MAX_EXPRs. It would also slow down the > gimple vectorizers. > > I apologize if my loose use of language confused the issue. It isn't > the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs > that are used by it. > > (The costs in powerpc won't be enormous, but they are definitely > mode-dependent in a way that vec_to_scalar is not. We'll need 2*log(n) > instructions, where n is the number of elements in the mode being > vectorized.) IIUC, on AArch64 a reduc_max_expr matches with a single reduction operation but on AArch32 Neon a reduc_smax gets implemented as a sequence of vpmax instructions which sounds similar to the PowerPC example as well. Thus mapping a reduc_smax expression to the cost of a vec_to_scalar is probably not right in this particular situation. regards Ramana > > A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar > that has to be broadcast back to a vector, and the best way to implement > it for us already has the max value in all positions of a vector. But > that is something we should be able to fix with simplify-rtx in the back > end. > > Thanks, > Bill > > > On Fri, 2015-09-11 at 10:15 +0100, Alan Hayward wrote: >> Hi Bill, >> >> I’d be a bit worried about asking the backend for the cost of a >> COND_REDUCTION, as that will rely on the backend understanding the >> implementation the vectorizer is using - every time the vectorizer >> changed, the backends would need to be updated too. I’m hoping soon to get >> together a patch to reduce the stmts produced on the simpler cases, which >> would require a different set of costings. I can also imagine further >> improvements being added for other special cases over time. Having the >> backends understand every variation would be a little cumbersome. >> >> As it stands today, we correctly exit the optimisation if max reduction >> isn’t supported in hardware, which is what the cost model is expecting. >> >> >> If power wanted to use this implementation, then I think it’d probably >> need some code in tree-vect-generic.c to implement on emulated max >> reduction, which would then require updates to the costs modelling of >> anything that uses max reduction (not just cond reduction). All of that is >> outside the scope of this patch. >> >> >> Thanks, >> Alan. >> >> On 10/09/2015 23:14, "Bill Schmidt" <wschm...@linux.vnet.ibm.com> wrote: >> >> >Hi Alan, >> > >> >The cost modeling of the epilogue code seems pretty target-specific ("An >> >EQ stmt and an AND stmt, reduction of the max index and a reduction of >> >the found values, a broadcast of the max value," resulting in two >> >vector_stmts, one vec_to_scalar, and two scalar_to_vecs). On powerpc, >> >this will not represent the cost accurately, and the cost will indeed be >> >quite different depending on the mode (logarithmic in the number of >> >elements). I think that you need to create a new entry in >> >vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow >> >each target to calculate the cost appropriately. >> > >> >(Powerpc doesn't have a max-reduction hardware instruction, but because >> >the reduction will be only in the epilogue code, it may still be >> >profitable for us to generate the somewhat expensive reduction sequence >> >in order to vectorize the loop. But we definitely want to model it as >> >costly in and of itself. Also, the sequence will produce the maximum >> >value in all positions without a separate broadcast.) >> > >> >Thanks, >> >Bill >> > >> >On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote: >> >> Hi, >> >> This patch (attached) adds support for vectorizing conditional >> >>expressions >> >> (PR 65947), for example: >> >> >> >> int condition_reduction (int *a, int min_v) >> >> { >> >> int last = 0; >> >> for (int i = 0; i < N; i++) >> >> if (a[i] < min_v) >> >> last = a[i]; >> >> return last; >> >> } >> >> >> >> To do this the loop is vectorised to create a vector of data results (ie >> >> of matching a[i] values). Using an induction variable, an additional >> >> vector is added containing the indexes where the matches occured. In the >> >> function epilogue this is reduced to a single max value and then used to >> >> index into the vector of data results. >> >> When no values are matched in the loop, the indexes vector will contain >> >> all zeroes, eventually matching the first entry in the data results >> >>vector. >> >> >> >> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This >> >>is >> >> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that >> >> REDUC_MAX_EXPR is not supported for the required modes, failing the >> >> vectorization. On mips it complains that the required vcond expression >> >>is >> >> not supported. It is suggested the relevant backend experts add the >> >> required backend support. >> >> >> >> Using a simple testcase based around a large number of N and run on an >> >> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of >> >> it's original time. >> >> >> >> This patch caused binary differences in three spec2006 binaries on >> >>aarch64 >> >> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board >> >> showed no improvement or degregation in runtime. >> >> >> >> >> >> In the near future I hope to submit a further patch (as PR 66558) which >> >> optimises the case where the result is simply the index of the loop, for >> >> example: >> >> int condition_reduction (int *a, int min_v) >> >> { >> >> int last = 0; >> >> for (int i = 0; i < N; i++) >> >> if (a[i] < min_v) >> >> last = i; >> >> return last; >> >> } >> >> In this case a lot of the new code can be optimized away. >> >> >> >> I have run check for aarch64, arm and x86 and have seen no regressions. >> >> >> >> >> >> Changelog: >> >> >> >> 2015-08-28 Alan Hayward <alan.hayw...@arm.com> >> >> >> >> PR tree-optimization/65947 >> >> * tree-vect-loop.c >> >> (vect_is_simple_reduction_1): Find condition reductions. >> >> (vect_model_reduction_cost): Add condition reduction costs. >> >> (get_initial_def_for_reduction): Add condition reduction initial >> >> var. >> >> (vect_create_epilog_for_reduction): Add condition reduction >> >>epilog. >> >> (vectorizable_reduction): Condition reduction support. >> >> * tree-vect-stmts.c >> >> (vectorizable_condition): Add vect reduction arg >> >> * doc/sourcebuild.texi (Vector-specific attributes): Document >> >> vect_max_reduc >> >> >> >> testsuite/Changelog: >> >> >> >> PR tree-optimization/65947 >> >> * lib/target-supports.exp >> >> (check_effective_target_vect_max_reduc): Add. >> >> * gcc.dg/vect/pr65947-1.c: New test. >> >> * gcc.dg/vect/pr65947-2.c: New test. >> >> * gcc.dg/vect/pr65947-3.c: New test. >> >> * gcc.dg/vect/pr65947-4.c: New test. >> >> * gcc.dg/vect/pr65947-5.c: New test. >> >> * gcc.dg/vect/pr65947-6.c: New test. >> >> * gcc.dg/vect/pr65947-7.c: New test. >> >> * gcc.dg/vect/pr65947-8.c: New test. >> >> * gcc.dg/vect/pr65947-9.c: New test. >> >> * gcc.dg/vect/pr65947-10.c: New test. >> >> * gcc.dg/vect/pr65947-11.c: New test. >> >> >> >> >> >> >> >> Thanks, >> >> Alan >> >> >> >> >> > >> > >> >> > >