[PATCH] Adjust costing of emulated vectorized gather/scatter

liuhongt via Gcc-patches Wed, 30 Aug 2023 03:37:36 -0700

r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
gather/scatter.
----
commit 24905a4bd1375ccd99c02510b9f9529015a48315
Author: Richard Biener <rguent...@suse.de>
Date:   Wed Jan 18 11:04:49 2023 +0100


    Adjust costing of emulated vectorized gather/scatter

    Emulated gather/scatter behave similar to strided elementwise
    accesses in that they need to decompose the offset vector
    and construct or decompose the data vector so handle them
    the same way, pessimizing the cases with may elements.
----

But for emulated gather/scatter, offset vector load/vec_construct has
aready been counted, and in real case, it's probably eliminated by
later optimizer.
Also after decomposing, element loads from continous memory could be
less bounded compared to normal elementwise load.
The patch decreases the cost a little bit.

This will enable gather emulation for below loop with VF=8(ymm)

double
foo (double* a, double* b, unsigned int* c, int n)
{
  double sum = 0;
  for (int i = 0; i != n; i++)
    sum += a[i] * b[c[i]];
  return sum;
}

For the upper loop, microbenchmark result shows on ICX,
emulated gather with VF=8 is 30% faster than emulated gather with
VF=4 when tripcount is big enough.
It bring back ~4% for 510.parest still ~5% regression compared to
gather instruction due to throughput bound.

For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
loop, VF remains 4(xmm) as before(guess related to their own cost
model).


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

        PR target/111064
        * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
        Decrease cost a little bit for vec_to_scalar(offset vector) in
        emulated gather.

gcc/testsuite/ChangeLog:

        * gcc.target/i386/pr111064.c: New test.
---
 gcc/config/i386/i386.cc                  | 11 ++++++++++-
 gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1bc3f11ff07..337e0f1bfbb 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
          || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
+      /* For emulated gather/scatter, offset vector load/vec_construct has
+        already been counted and in real case, it's probably eliminated by
+        later optimizer.
+        Also after decomposing, element loads from continous memory
+        could be less bounded compared to normal elementwise load.  */
+      if (kind == vec_to_scalar
+         && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+       stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+      else
+       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
   else if ((kind == vec_construct || kind == scalar_to_vec)
           && node
diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c 
b/gcc/testsuite/gcc.target/i386/pr111064.c
new file mode 100644
index 00000000000..aa2589bd36f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr111064.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! 
ia32 } } } }  */
+
+double
+foo (double* a, double* b, unsigned int* c, int n)
+{
+  double sum = 0;
+  for (int i = 0; i != n; i++)
+    sum += a[i] * b[c[i]];
+  return sum;
+}
-- 
2.31.1

[PATCH] Adjust costing of emulated vectorized gather/scatter

Reply via email to