This patch enables VECT_COMPARE_COSTS by default for SVE, both so that we can compare SVE against Advanced SIMD and so that (with future patches) we can compare multiple SVE vectorisation approaches against each other. It also adds a target-specific --param to control this.
It only became possible to have target-specific --params after Martin's patches earlier in the week (thanks!). Since this is the first one, it looks like a bit of an odd-one-out. But I think the list is going to grow over time. I certainly have other SVE-related things I'd like to put behind a --param in future. Tested on aarch64-linux-gnu and applied as r278337. Richard 2019-11-16 Richard Sandiford <richard.sandif...@arm.com> gcc/ * config/aarch64/aarch64.opt (--param=aarch64-sve-compare-costs): New option. * doc/invoke.texi: Document it. * config/aarch64/aarch64.c (aarch64_autovectorize_vector_modes): By default, return VECT_COMPARE_COSTS for SVE. gcc/testsuite/ * gcc.target/aarch64/sve/reduc_3.c: Split multi-vector cases out into... * gcc.target/aarch64/sve/reduc_3_costly.c: ...this new test, passing -fno-vect-cost-model for them. * gcc.target/aarch64/sve/slp_6.c: Add -fno-vect-cost-model. * gcc.target/aarch64/sve/slp_7.c, * gcc.target/aarch64/sve/slp_7_run.c: Split multi-vector cases out into... * gcc.target/aarch64/sve/slp_7_costly.c, * gcc.target/aarch64/sve/slp_7_costly_run.c: ...these new tests, passing -fno-vect-cost-model for them. * gcc.target/aarch64/sve/while_7.c: Add -fno-vect-cost-model. * gcc.target/aarch64/sve/while_9.c: Likewise. Index: gcc/config/aarch64/aarch64.opt =================================================================== --- gcc/config/aarch64/aarch64.opt 2019-09-27 09:09:26.771844993 +0100 +++ gcc/config/aarch64/aarch64.opt 2019-11-16 10:42:55.025462691 +0000 @@ -258,3 +258,7 @@ long aarch64_stack_protector_guard_offse moutline-atomics Target Report Mask(OUTLINE_ATOMICS) Save Generate local calls to out-of-line atomic operations. + +-param=aarch64-sve-compare-costs= +Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param +When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach. Also use the cost model to choose between SVE and Advanced SIMD vectorization. Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi 2019-11-14 14:34:25.707796466 +0000 +++ gcc/doc/invoke.texi 2019-11-16 10:42:55.033462635 +0000 @@ -11179,8 +11179,8 @@ without notice in future releases. In order to get minimal, maximal and default value of a parameter, one can use @option{--help=param -Q} options. -In each case, the @var{value} is an integer. The allowable choices for -@var{name} are: +In each case, the @var{value} is an integer. The following choices +of @var{name} are recognized for all targets: @table @gcctabopt @item predictable-branch-outcome @@ -12396,6 +12396,20 @@ statements or when determining their val diagnostics. @end table + +The following choices of @var{name} are available on AArch64 targets: + +@table @gcctabopt +@item aarch64-sve-compare-costs +When vectorizing for SVE, consider using ``unpacked'' vectors for +smaller elements and use the cost model to pick the cheapest approach. +Also use the cost model to choose between SVE and Advanced SIMD vectorization. + +Using unpacked vectors includes storing smaller elements in larger +containers and accessing elements with extending loads and truncating +stores. +@end table + @end table @node Instrumentation Options Index: gcc/config/aarch64/aarch64.c =================================================================== --- gcc/config/aarch64/aarch64.c 2019-11-16 10:40:08.402638818 +0000 +++ gcc/config/aarch64/aarch64.c 2019-11-16 10:42:55.025462691 +0000 @@ -15962,7 +15962,15 @@ aarch64_autovectorize_vector_modes (vect for this case. */ modes->safe_push (V2SImode); - return 0; + unsigned int flags = 0; + /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we + can compare SVE against Advanced SIMD and so that we can compare + multiple SVE vectorization approaches against each other. There's + not really any point doing this for Advanced SIMD only, since the + first mode that works should always be the best. */ + if (TARGET_SVE && aarch64_sve_compare_costs) + flags |= VECT_COMPARE_COSTS; + return flags; } /* Implement TARGET_MANGLE_TYPE. */ Index: gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c 2019-11-06 12:28:21.000000000 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c 2019-11-16 10:42:55.033462635 +0000 @@ -17,7 +17,6 @@ void reduc_ptr_##DSTTYPE##_##SRCTYPE (DS REDUC_PTR (int8_t, int8_t) REDUC_PTR (int16_t, int16_t) - REDUC_PTR (int32_t, int32_t) REDUC_PTR (int64_t, int64_t) @@ -25,17 +24,6 @@ REDUC_PTR (_Float16, _Float16) REDUC_PTR (float, float) REDUC_PTR (double, double) -/* Widening reductions. */ -REDUC_PTR (int32_t, int8_t) -REDUC_PTR (int32_t, int16_t) - -REDUC_PTR (int64_t, int8_t) -REDUC_PTR (int64_t, int16_t) -REDUC_PTR (int64_t, int32_t) - -REDUC_PTR (float, _Float16) -REDUC_PTR (double, float) - /* Float<>Int conversions */ REDUC_PTR (_Float16, int16_t) REDUC_PTR (float, int32_t) @@ -45,8 +33,14 @@ REDUC_PTR (int16_t, _Float16) REDUC_PTR (int32_t, float) REDUC_PTR (int64_t, double) -/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 3 } } */ -/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 { xfail *-*-* } } } */ +/* We don't yet vectorize the int<-float cases. */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ /* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 3 } } */ -/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 3 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/reduc_3_costly.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/reduc_3_costly.c 2019-11-16 10:42:55.033462635 +0000 @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */ + +#include <stdint.h> + +#define NUM_ELEMS(TYPE) (32 / sizeof (TYPE)) + +#define REDUC_PTR(DSTTYPE, SRCTYPE) \ +void reduc_ptr_##DSTTYPE##_##SRCTYPE (DSTTYPE *restrict sum, \ + SRCTYPE *restrict array, \ + int count) \ +{ \ + *sum = 0; \ + for (int i = 0; i < count; ++i) \ + *sum += array[i]; \ +} + +/* Widening reductions. */ +REDUC_PTR (int32_t, int8_t) +REDUC_PTR (int32_t, int16_t) + +REDUC_PTR (int64_t, int8_t) +REDUC_PTR (int64_t, int16_t) +REDUC_PTR (int64_t, int32_t) + +REDUC_PTR (float, _Float16) +REDUC_PTR (double, float) + +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 3 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/slp_6.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/slp_6.c 2019-11-06 12:28:21.000000000 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/slp_6.c 2019-11-16 10:42:55.033462635 +0000 @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */ +/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math -fno-vect-cost-model" } */ #include <stdint.h> Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/slp_7.c 2019-11-06 12:28:21.000000000 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/slp_7.c 2019-11-16 10:42:55.033462635 +0000 @@ -31,37 +31,27 @@ #define TEST_ALL(T) \ T (uint16_t) \ T (int32_t) \ T (uint32_t) \ - T (int64_t) \ - T (uint64_t) \ T (_Float16) \ - T (float) \ - T (double) + T (float) TEST_ALL (VEC_PERM) -/* We can't use SLP for the 64-bit loops, since the number of reduction - results might be greater than the number of elements in the vector. - Otherwise we have two loads per loop, one for the initial vector - and one for the loop body. */ +/* We have two loads per loop, one for the initial vector and one for + the loop body. */ /* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */ /* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */ /* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */ -/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */ /* { dg-final { scan-assembler-not {\tld4b\t} } } */ /* { dg-final { scan-assembler-not {\tld4h\t} } } */ /* { dg-final { scan-assembler-not {\tld4w\t} } } */ -/* { dg-final { scan-assembler-not {\tld1d\t} } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */ -/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */ /* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */ /* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */ -/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */ -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */ /* { dg-final { scan-assembler-not {\tuqdec} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c 2019-11-06 12:28:21.000000000 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c 2019-11-16 10:42:55.033462635 +0000 @@ -1,7 +1,11 @@ /* { dg-do run { target aarch64_sve_hw } } */ /* { dg-options "-O2 -ftree-vectorize -ffast-math" } */ -#include "slp_7.c" +#ifndef FILENAME +#define FILENAME "slp_7.c" +#endif + +#include FILENAME #define N (54 * 4) Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly.c 2019-11-16 10:42:55.033462635 +0000 @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math -fno-vect-cost-model" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + TYPE x2 = b[2]; \ + TYPE x3 = b[3]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 4]; \ + x1 += a[i * 4 + 1]; \ + x2 += a[i * 4 + 2]; \ + x3 += a[i * 4 + 3]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ + b[2] = x2; \ + b[3] = x3; \ +} + +#define TEST_ALL(T) \ + T (int64_t) \ + T (uint64_t) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* We can't use SLP for the 64-bit loops, since the number of reduction + results might be greater than the number of elements in the vector. */ +/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */ +/* { dg-final { scan-assembler-not {\tld1d\t} } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ + +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */ + +/* { dg-final { scan-assembler-not {\tuqdec} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly_run.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly_run.c 2019-11-16 10:42:55.033462635 +0000 @@ -0,0 +1,5 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */ + +#define FILENAME "slp_7_costly.c" +#include "slp_7_run.c" Index: gcc/testsuite/gcc.target/aarch64/sve/while_7.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/while_7.c 2019-08-13 22:33:36.221955159 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/while_7.c 2019-11-16 10:42:55.033462635 +0000 @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */ +/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -fno-vect-cost-model" } */ #include <stdint.h> Index: gcc/testsuite/gcc.target/aarch64/sve/while_9.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/while_9.c 2019-08-13 22:33:36.221955159 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/while_9.c 2019-11-16 10:42:55.033462635 +0000 @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */ +/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -fno-vect-cost-model" } */ #include <stdint.h>