r14-332-g24905a4bd1375c adjusts costing of emulated vectorized gather/scatter. ---- commit 24905a4bd1375ccd99c02510b9f9529015a48315 Author: Richard Biener <rguent...@suse.de> Date: Wed Jan 18 11:04:49 2023 +0100
Adjust costing of emulated vectorized gather/scatter Emulated gather/scatter behave similar to strided elementwise accesses in that they need to decompose the offset vector and construct or decompose the data vector so handle them the same way, pessimizing the cases with may elements. ---- But for emulated gather/scatter, offset vector load/vec_construct has aready been counted, and in real case, it's probably eliminated by later optimizer. Also after decomposing, element loads from continous memory could be less bounded compared to normal elementwise load. The patch decreases the cost a little bit. This will enable gather emulation for below loop with VF=8(ymm) double foo (double* a, double* b, unsigned int* c, int n) { double sum = 0; for (int i = 0; i != n; i++) sum += a[i] * b[c[i]]; return sum; } For the upper loop, microbenchmark result shows on ICX, emulated gather with VF=8 is 30% faster than emulated gather with VF=4 when tripcount is big enough. It bring back ~4% for 510.parest still ~5% regression compared to gather instruction due to throughput bound. For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the loop, VF remains 4(xmm) as before(guess related to their own cost model). Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/111064 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Decrease cost a little bit for vec_to_scalar(offset vector) in emulated gather. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111064.c: New test. --- gcc/config/i386/i386.cc | 11 ++++++++++- gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 1bc3f11ff07..337e0f1bfbb 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); - stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); + /* For emulated gather/scatter, offset vector load/vec_construct has + already been counted and in real case, it's probably eliminated by + later optimizer. + Also after decomposing, element loads from continous memory + could be less bounded compared to normal elementwise load. */ + if (kind == vec_to_scalar + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) + stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype); + else + stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); } else if ((kind == vec_construct || kind == scalar_to_vec) && node diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c new file mode 100644 index 00000000000..aa2589bd36f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr111064.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */ +/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } } */ + +double +foo (double* a, double* b, unsigned int* c, int n) +{ + double sum = 0; + for (int i = 0; i != n; i++) + sum += a[i] * b[c[i]]; + return sum; +} -- 2.31.1