Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi all,
>
> While experimenting with some backend costs for Advanced SIMD and SVE I hit
> many cases where GCC would pick SVE for VLA auto-vectorisation even when the
> backend very clearly presented cheaper costs for Advanced SIMD.
> For a simple float addition loop the SVE costs were:
>
> vec.c:9:21: note:  Cost model analysis:
>   Vector inside of loop cost: 28
>   Vector prologue cost: 2
>   Vector epilogue cost: 0
>   Scalar iteration cost: 10
>   Scalar outside cost: 0
>   Vector outside cost: 2
>   prologue iterations: 0
>   epilogue iterations: 0
>   Minimum number of vector iterations: 1
>   Calculated minimum iters for profitability: 4
>   
> and for Advanced SIMD (Neon) they're:
>
> vec.c:9:21: note:  Cost model analysis:
>   Vector inside of loop cost: 11
>   Vector prologue cost: 0
>   Vector epilogue cost: 0
>   Scalar iteration cost: 10
>   Scalar outside cost: 0
>   Vector outside cost: 0
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 0
> vec.c:9:21: note:    Runtime profitability threshold = 4

Just to expand on this for others on the list: this is comparing
SVE with an estimated VL of 256 bits with Advanced SIMD at 128 bits,
so for 8 floats it's 28 vs 22.

For generic SVE we'd justify using VLA in that situation because the
gap is relatively small and SVE would (according to the cost model)
be a clear win beyond 256 bits.

But in this case, the 256-bit VL estimate comes directly from a
-mcpu/-mtune option, so it is more definite than the usual estimates
for generic SVE.  What happens at larger VL is pretty much irrevelant
in this case.

> yet the SVE one was always picked. With guidance from Richard this seems to
> be due to the vinfo comparisons in vect_better_loop_vinfo_p, in particular the
> part with the big comment explaining the
> estimated_rel_new * 2 <= estimated_rel_old heuristic.
>
> This patch extends the comparisons by introducing a three-way estimate
> kind for poly_int values that the backend can distinguish.
> This allows vect_better_loop_vinfo_p to ask for minimum, maximum and likely
> estimates and pick Advanced SIMD overs SVE when it is clearly cheaper.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Manually checked that with reasonable separate costs for Advanced SIMD and SVE
> GCC picks up one over the other in ways I'd expect.
>
> Ok for trunk?
> Thanks,
> Kyrill
>
> gcc/
>       * target.h (enum poly_value_estimate_kind): Define.
>       (estimated_poly_value): Take an estimate kind argument.
>       * target.def (estimated_poly_value): Update definition for the above.
>       * doc/tm.texi: Regenerate.
>       * tree-vect-loop.c (vect_better_loop_vinfo_p): Use min, max and likely
>       estimates of VF to pick between vinfos.
>       * config/aarch64/aarch64.c (aarch64_cmp_autovec_modes): Use
>       estimated_poly_value instead of aarch64_estimated_poly_value.
>       (aarch64_estimated_poly_value): Take a kind argument and handle it.

OK, thanks.

Richard

Reply via email to