From: Christopher Bazley <[email protected]>

An anti-pattern found in compiled code when predicated tails were
enabled for basic block SLP vectorization was triggered by
byte-reversing patterns in source code, such as:

uint8_t *dst;
int size;
dst[0] = size >> 24;
dst[1] = size >> 16;
dst[2] = size >> 8;
dst[3] = size >> 0;

which would previously have compiled to:

rev    w1, w1
str    w1, [x0]

but (with tail-predication) was vectorized as:

mov     z31.b, w1
ptrue   p7.s, vl4
fmov    s30, w1
sshr    v29.2s, v30.2s, #8
insr    z31.s, s29
sshr    v30.2s, v30.2s, #16
insr    z31.s, s30
fmov    s30, w1
sshr    v30.2s, v30.2s, #24
insr    z31.s, s30
st1b    {z31.s}, p7, [x0]

One reason is that the SLP pass runs before the store-merging
pass gets a chance to coalesce 4 stores into 1 and substitute a
32 bit bswap implementation. Even ignoring that, costing of the
vectorized version (cost: 4) compared to the scalar version
(also 4) was not realistic:

_2 1 times vector_store costs 1 in body
node 0x32ee6d0 1 times vec_construct costs 3 in prologue

There were a couple of contributing issues:
1. the cost of mask construction for the vector_store (ptrue) was
omitted for BB SLP, whereas the loop vectorizer explicitly charges
for it.
2. the cost of vec_construct (elements / 2 + 1) did not incorporate
any GPR-to-SIMD register transfer costs (mov, fmov).

Since the supposed cost of the vectorised code only just reached parity
with the scalar code, addressing either of the above issues would be
sufficient to prevent vectorisation (in this specific case). It is also
less risky than changing the order of passes, and less hacky than
teaching the SLP pass about store-merging.

This commit addresses only the second issue, by adding code in
vector_costs::add_stmt_cost to charge scalar_to_vec_cost for each
element of an external def of kind vec_construct (with specific
exceptions noted below). This cost is added to the base cost
already charged by aarch64_builtin_vectorization_cost for a
vec_construct (which is assumed to cover the cost of the INSR or
equivalent instructions).

This is justifiable because SIMD-to-SIMD insertions into a vector
register generally have lower latency and higher throughput than
GPR-to-SIMD insertions.

The basic structure of the code was copied from commit
90d693bdc9d71841f51d68826ffa5bd685d7f0bc which modified the x86
backend in a similar way, but adapted to use a hash_set<tree>
instead of TREE_VISITED to guard against charging twice or more for
the same scalar op feeding an external def.

This commit assumes that constructing a vector from memory
is no more costly than the equivalent set of scalar loads (or at least
that any difference is incorporated in the cost returned by
aarch64_builtin_vectorization_cost for vec_construct). It also assumes
that constructing a vector from scalar values of floating point type,
from a BIT_FIELD_REF/lastb that extracts from a vector register, or
from the result of a call to an inbuilt reduction function, does not
incur GPR-to-SIMD register transfer costs because such scalars are
typically already in FP/SIMD registers on AArch64.

gcc/ChangeLog:

        * config/aarch64/aarch64.cc (aarch64_call_scalar_result_in_simd_reg_p):
        New function to determine probabilistically whether a gcall
        produces a scalar result in a SIMD/FP register.
        (aarch64_scalar_op_to_vec_p): New function to determine
        whether or not to add scalar_to_vec_cost per scalar operand
        from which a vector is to be constructed.
        (aarch64_external_adjust_stmt_cost): New function to adjust the
        cost of an SLP tree node for a vec_construct that is fed by
        values defined outside the vectorized region.
        (aarch64_vector_costs::add_stmt_cost): Call the new
        aarch64_external_adjust_stmt_cost function if we have an SLP
        node and a vector type.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/sve/vec_construct_1.c: New test.
        * gcc.target/aarch64/sve/vec_construct_2.c: New test.
        * gcc.target/aarch64/sve/vec_construct_3.c: New test.
        * gcc.target/aarch64/sve/vec_construct_4.c: New test.
        * gcc.target/aarch64/sve/vec_construct_5.c: New test.
        * gcc.target/aarch64/vec-construct-1.c: New test.
        * gcc.target/aarch64/vec-construct-10.c: New test.
        * gcc.target/aarch64/vec-construct-11.c: New test.
        * gcc.target/aarch64/vec-construct-12.c: New test.
        * gcc.target/aarch64/vec-construct-2.c: New test.
        * gcc.target/aarch64/vec-construct-3.c: New test.
        * gcc.target/aarch64/vec-construct-4.c: New test.
        * gcc.target/aarch64/vec-construct-5.c: New test.
        * gcc.target/aarch64/vec-construct-6.c: New test.
        * gcc.target/aarch64/vec-construct-7.c: New test.
        * gcc.target/aarch64/vec-construct-8.c: New test.
        * gcc.target/aarch64/vec-construct-9.c: New test.
---
 gcc/config/aarch64/aarch64.cc                 | 147 ++++++++++++++++++
 .../gcc.target/aarch64/sve/vec_construct_1.c  |  37 +++++
 .../gcc.target/aarch64/sve/vec_construct_2.c  |  42 +++++
 .../gcc.target/aarch64/sve/vec_construct_3.c  |  39 +++++
 .../gcc.target/aarch64/sve/vec_construct_4.c  |  37 +++++
 .../gcc.target/aarch64/sve/vec_construct_5.c  |  37 +++++
 .../gcc.target/aarch64/vec-construct-1.c      |  28 ++++
 .../gcc.target/aarch64/vec-construct-10.c     |  42 +++++
 .../gcc.target/aarch64/vec-construct-11.c     |  37 +++++
 .../gcc.target/aarch64/vec-construct-12.c     |  35 +++++
 .../gcc.target/aarch64/vec-construct-2.c      |  33 ++++
 .../gcc.target/aarch64/vec-construct-3.c      |  30 ++++
 .../gcc.target/aarch64/vec-construct-4.c      |  38 +++++
 .../gcc.target/aarch64/vec-construct-5.c      |  34 ++++
 .../gcc.target/aarch64/vec-construct-6.c      |  42 +++++
 .../gcc.target/aarch64/vec-construct-7.c      |  37 +++++
 .../gcc.target/aarch64/vec-construct-8.c      |  41 +++++
 .../gcc.target/aarch64/vec-construct-9.c      |  35 +++++
 18 files changed, 771 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-10.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-11.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-12.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-9.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37c28c8f2f8f..821ff63d47c7 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -18610,6 +18610,149 @@ aarch64_possible_by_lane_insn_p (vec_info *m_vinfo, 
gimple *stmt)
   return false;
 }
 
+/* Determine probabilistically whether CALL is one that produces a scalar
+   result in a SIMD register.  */
+static bool
+aarch64_call_scalar_result_in_simd_reg_p (const gcall *call)
+{
+  /* Don't assume that non-built-in functions return results in SIMD registers
+     because the ABI says that a scalar integer result is returned in a GPR.  
*/
+  if (!gimple_call_internal_p (call)
+      && !gimple_call_builtin_p (call, BUILT_IN_MD))
+    return false;
+
+  /* Assume that built-in functions which have at least one floating point or
+     SIMD parameter return scalar results in a SIMD register.  This heuristic
+     covers both reductions (whose result is always in a SIMD register) and
+     vector element extractions such as lastb (where the result can be in a GPR
+     or SIMD register, but instruction selection is assumed to make the choice
+     that is most efficient for the usage).  */
+  for (unsigned i = 0; i < gimple_call_num_args (call); ++i)
+    if (VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (call, i)))
+       || SCALAR_FLOAT_TYPE_P (TREE_TYPE (gimple_call_arg (call, i))))
+      return true;
+
+  /* Assume that other built-in functions return scalar results in a GPR.  */
+  return false;
+}
+
+/* Determine probabilistically whether the scalar operand OP is one that could
+   incur additional costs for a GPR to SIMD register transfer.  We can't be
+   sure, but certain operations have a higher chance.  */
+static bool
+aarch64_scalar_op_to_vec_p (tree op)
+{
+  gcc_checking_assert (TREE_CODE (op) == SSA_NAME);
+
+  tree optype = TREE_TYPE (op);
+  if (SCALAR_FLOAT_TYPE_P (optype))
+    return false;
+
+  gcc_checking_assert (!AGGREGATE_TYPE_P (optype));
+  gcc_checking_assert (!VECTOR_TYPE_P (optype));
+
+  gimple *def = SSA_NAME_DEF_STMT (op);
+  if (is_gimple_assign (def)
+      && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)))
+    {
+      tree lhs = gimple_assign_lhs (def);
+      tree rhs = gimple_assign_rhs1 (def);
+      if (TREE_CODE (rhs) == SSA_NAME
+         /* A sign-change expands to nothing.  */
+         && tree_nop_conversion_p (TREE_TYPE (lhs), TREE_TYPE (rhs)))
+       def = SSA_NAME_DEF_STMT (rhs);
+    }
+
+  /* When the defining statement reads from memory, we can sometimes load its
+     value directly into a vector register lane, for example using
+       LD1 {v31.b}[1], [x0]
+     In reality, such operations usually seem to be lowered to a load-insert
+     pair instead, for example to allow pre-indexed addressing:
+       LDR b30, [x0, 4]
+       INS v31.b[1], v30.b[0]
+     Regardless, we do not charge extra scalar-to-vector costs for loads
+     from memory that feed a vec_construct because:
+       1.  builtin_vectorization_cost should already have charged any
+     insertion costs.
+       2.  Charging scalar-to-vector costs for loads would change how more
+     code is compiled. (Costs of scalar loads feeding a vec_construct are
+     charged separately; assume they subsume the cost of any SIMD load
+     instructions used in place of GPR load instructions as a consequence of
+     vectorization.)
+  */
+  if (gimple_vuse (def))
+    return false;
+
+  /* Likewise, we can hope to avoid using an intermediate GPR when
+     constructing a vector from a BIT_FIELD_REF that extracts from a
+     vector register.  */
+  if (is_gimple_assign (def) && gimple_assign_rhs_code (def) == BIT_FIELD_REF
+      && VECTOR_TYPE_P (TREE_TYPE (TREE_OPERAND (gimple_assign_rhs1 (def), 
0))))
+    return false;
+
+  /* Likewise, we can hope to avoid using an intermediate GPR when
+     constructing a vector from the result of a vector reduction.  */
+  if (const gcall *call = dyn_cast<const gcall *> (def))
+    if (aarch64_call_scalar_result_in_simd_reg_p (call))
+      return false;
+
+  /* Likewise, we can hope to avoid using an intermediate GPR when
+     constructing a vector from the integer result of a vector reduction
+     that is immediately narrowed.  */
+  if (is_gimple_assign (def)
+      && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)))
+    {
+      tree lhs = gimple_assign_lhs (def);
+      tree rhs = gimple_assign_rhs1 (def);
+
+      if (TREE_CODE (rhs) == SSA_NAME && INTEGRAL_TYPE_P (TREE_TYPE (lhs))
+         && INTEGRAL_TYPE_P (TREE_TYPE (rhs))
+         && (TYPE_PRECISION (TREE_TYPE (lhs))
+             < TYPE_PRECISION (TREE_TYPE (rhs))))
+       {
+         gimple *rhs_def = SSA_NAME_DEF_STMT (rhs);
+         if (const gcall *call = dyn_cast<const gcall *> (rhs_def))
+           if (aarch64_call_scalar_result_in_simd_reg_p (call))
+             return false;
+       }
+    }
+
+  /* Otherwise, treat every component as requiring a GPR to SIMD
+     register transfer.  Notably, this prevents reverse-bytes ops
+     from being erroneously vectorized before reaching the store
+     merging pass that is supposed to ultimately produce REV.  */
+  return true;
+}
+
+/* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
+   for NODE, which has cost kind KIND and which when vectorized would
+   operate on vector type VECTYPE.  Adjust the cost as necessary for a value
+   that is not defined within the vectorized region.  */
+static fractional_cost
+aarch64_external_adjust_stmt_cost (vect_cost_for_stmt kind, slp_tree node,
+                                  tree vectype, fractional_cost stmt_cost)
+{
+  if (SLP_TREE_DEF_TYPE (node) != vect_external_def)
+    return stmt_cost;
+
+  if (kind != vec_construct)
+    return stmt_cost;
+
+  const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
+  hash_set<tree> visited;
+
+  for (auto op : SLP_TREE_SCALAR_OPS (node))
+    {
+      if (TREE_CODE (op) != SSA_NAME || visited.add (op))
+       continue;
+
+      if (aarch64_scalar_op_to_vec_p (op))
+       stmt_cost += simd_costs->scalar_to_vec_cost;
+    }
+
+  return stmt_cost;
+}
+
 unsigned
 aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                                     stmt_vec_info stmt_info, slp_tree node,
@@ -18800,6 +18943,10 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
       m_stores_to_vector_load_decl = true;
     }
 
+  if (node && vectype)
+    stmt_cost
+      = aarch64_external_adjust_stmt_cost (kind, node, vectype, stmt_cost);
+
   return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
 }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
new file mode 100644
index 000000000000..2f8ce6808a98
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is vectorized by constructing a vector and storing it.
+   Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  Only one STR (Store
+   vector reg, unsigned immed, B/H/S/D-form) instruction is required
+   instead of 8.
+ */
+#include <arm_sve.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
+     svint8_t src5, svint8_t src6, svint8_t src7)
+{
+  svbool_t all = svptrue_b8 ();
+  s.a = svmaxv_s8 (all, src0);
+  s.b = svminv_s8 (all, src1);
+  s.c = svlastb_s8 (svptrue_pat_b8 (SV_VL1), src2);
+  s.d = svaddv_s8 (all, src3);
+  s.e = svmaxv_s8 (all, src4);
+  s.f = svminv_s8 (all, src5);
+  s.g = svlastb_s8 (svptrue_pat_b8 (SV_VL1), src6);
+  s.h = svaddv_s8 (all, src7);
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
+/* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
new file mode 100644
index 000000000000..6715118d7b09
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from the results of calls
+   to a function that has only vector parameters and returns a scalar result is
+   not vectorized by constructing a vector and storing it, given that the
+   GPR-to-SIMD version of INS (which would have had to be used to vectorize 
this
+   code) typically has higher latency and lower throughput than the 
SIMD-to-SIMD
+   version of INS.  This is a test for misidentification of builtin reductions.
+ */
+
+#include <arm_sve.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+int8_t __attribute__ ((noinline, const))
+bar (svint8_t v)
+{
+  return v[0];
+}
+
+void
+foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
+     svint8_t src5, svint8_t src6, svint8_t src7)
+{
+  s.a = bar (src0);
+  s.b = bar (src1);
+  s.c = bar (src2);
+  s.d = bar (src3);
+  s.e = bar (src4);
+  s.f = bar (src5);
+  s.g = bar (src6);
+  s.h = bar (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, } 8 } } */
+
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
new file mode 100644
index 000000000000..8143d0050ade
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is vectorized by constructing a vector and storing it
+   even if the results of the reductions are narrowed.
+   Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  Only one STR (Store
+   vector reg, unsigned immed, B/H/S/D-form) instruction is required
+   instead of 8.
+ */
+#include <arm_sve.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (svint16_t src0, svint32_t src1, svint16_t src2, svint32_t src3,
+     svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7)
+{
+  svbool_t all16 = svptrue_b16 ();
+  svbool_t all32 = svptrue_b32 ();
+  s.a = svmaxv_s16 (all16, src0);
+  s.b = svminv_s32 (all32, src1);
+  s.c = svlastb_s16 (svptrue_pat_b16 (SV_VL1), src2);
+  s.d = svaddv_s32 (all32, src3);
+  s.e = svmaxv_s32 (all32, src4);
+  s.f = svminv_s16 (all16, src5);
+  s.g = svlastb_s32 (svptrue_pat_b32 (SV_VL1), src6);
+  s.h = svaddv_s16 (all16, src7);
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
+/* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
new file mode 100644
index 000000000000..49f8114b64c1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is not vectorized by constructing a vector and storing it
+   if the results of the reductions are widened.  Widening typically
+   causes the result of a reduction to be transferred to a GPR, therefore
+   vectorization would require GPR-to-SIMD-register transfers.
+ */
+#include <arm_sve.h>
+
+struct S
+{
+  int32_t a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (svint16_t src0, svint8_t src1, svint16_t src2, svint8_t src3,
+     svint8_t src4, svint16_t src5, svint8_t src6, svint16_t src7)
+{
+  svbool_t all16 = svptrue_b16 ();
+  svbool_t all8 = svptrue_b8 ();
+  s.a = svmaxv_s16 (all16, src0);
+  s.b = svminv_s8 (all8, src1);
+  s.c = svlastb_s16 (svptrue_pat_b16 (SV_VL1), src2);
+  s.d = svaddv_s8 (all8, src3);
+  s.e = svmaxv_s8 (all8, src4);
+  s.f = svminv_s16 (all16, src5);
+  s.g = svlastb_s8 (svptrue_pat_b8 (SV_VL1), src6);
+  s.h = svaddv_s16 (all16, src7);
+}
+
+/* { dg-final { scan-assembler-times {\tstp\tw[0-9]+, w[0-9]+,} 4 } } */
+
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.s\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } }
+/* { dg-final { scan-assembler-not {\tstp\tq[0-9]+, q[0-9]+,} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
new file mode 100644
index 000000000000..983d6c69ebc7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from lane extractions is
+   vectorized by constructing a vector and storing it.  Since there are no
+   GPR-to-SIMD register transfers, there is no need to charge additional costs
+   for them.  Only one STR (Store vector reg, unsigned immed, B/H/S/D-form)
+   instruction is required instead of 8.  This is a test for over-restriction
+   of an exemption from usual GPR-to-SIMD costs to reductions (in case lane
+   extraction is not treated as a reduction).
+ */
+#include <arm_sve.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
+     svint8_t src5, svint8_t src6, svint8_t src7, svbool_t p)
+{
+  s.a = svlastb_s8 (p, src0);
+  s.b = svlastb_s8 (p, src1);
+  s.c = svlastb_s8 (p, src2);
+  s.d = svlastb_s8 (p, src3);
+  s.e = svlastb_s8 (p, src4);
+  s.f = svlastb_s8 (p, src5);
+  s.g = svlastb_s8 (p, src6);
+  s.h = svlastb_s8 (p, src7);
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
+/* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-1.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-1.c
new file mode 100644
index 000000000000..c101572bd35a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-1.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 4 elements held in SIMD registers is
+   vectorized by constructing a vector and storing it, given that the
+   SIMD-to-SIMD version of INS (used here) typically has lower latency
+   and higher throughput than the GPR-to-SIMD version of INS.  Despite
+   that, any benefit of vectorization is expected to be marginal
+   (including very little reduction in code size). */
+
+struct S
+{
+  __fp16 a, b, c, d;
+} s;
+
+void
+foo (__fp16 a, __fp16 b, __fp16 c, __fp16 d)
+{
+  s.a = a;
+  s.b = b;
+  s.c = c;
+  s.d = d;
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[[0-9]+\], 
v[0-9]+\.h\[[0-9]+\]\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, \[x[0-9]+.*\]\n} 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\th[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-10.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-10.c
new file mode 100644
index 000000000000..c9e25e61adfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-10.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from the results of calls
+   to a function that has only vector parameters and returns a scalar result is
+   not vectorized by constructing a vector and storing it, given that the
+   GPR-to-SIMD version of INS (which would have had to be used to vectorize 
this
+   code) typically has higher latency and lower throughput than the 
SIMD-to-SIMD
+   version of INS.  This is a test for misidentification of builtin reductions.
+ */
+
+#include <arm_neon.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+int8_t __attribute__ ((noinline, const))
+bar (int8x8_t v)
+{
+        return v[0];
+}
+
+void
+foo (int8x8_t src0, int8x8_t src1, int8x8_t src2, int8x8_t src3, int8x8_t src4,
+     int8x8_t src5, int8x8_t src6, int8x8_t src7)
+{
+  s.a = bar (src0);
+  s.b = bar (src1);
+  s.c = bar (src2);
+  s.d = bar (src3);
+  s.e = bar (src4);
+  s.f = bar (src5);
+  s.g = bar (src6);
+  s.h = bar (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, } 8 } } */
+
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-11.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-11.c
new file mode 100644
index 000000000000..d3104c8edee9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-11.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is vectorized by constructing a vector and storing it
+   even if the results of the reductions are narrowed.
+   Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  Only one STR (Store
+   vector reg, unsigned immed, B/H/S/D-form) instruction is required
+   instead of 8.
+ */
+#include <arm_neon.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (int16x4_t src0, int32x2_t src1, int16x4_t src2, int32x2_t src3,
+     int32x2_t src4, int16x4_t src5, int32x2_t src6, int16x4_t src7)
+{
+  s.a = vmaxv_s16 (src0);
+  s.b = vminv_s32 (src1);
+  s.c = vduph_lane_s16 (src2, 2);
+  s.d = vaddv_s32 (src3);
+  s.e = vmaxv_s32 (src4);
+  s.f = vminv_s16 (src5);
+  s.g = vdups_lane_s32 (src6, 1);
+  s.h = vaddv_s16 (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
+/* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-12.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-12.c
new file mode 100644
index 000000000000..b12d84efbc6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-12.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is not vectorized by constructing a vector and storing it
+   if the results of the reductions are widened.  Widening typically
+   causes the result of a reduction to be transferred to a GPR, therefore
+   vectorization would require GPR-to-SIMD-register transfers.
+ */
+#include <arm_neon.h>
+
+struct S
+{
+  int32_t a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (int16x4_t src0, int8x8_t src1, int16x4_t src2, int8x8_t src3,
+     int8x8_t src4, int16x4_t src5, int8x8_t src6, int16x4_t src7)
+{
+  s.a = vmaxv_s16 (src0);
+  s.b = vminv_s8 (src1);
+  s.c = vduph_lane_s16 (src2, 2);
+  s.d = vaddv_s8 (src3);
+  s.e = vmaxv_s8 (src4);
+  s.f = vminv_s16 (src5);
+  s.g = vdupb_lane_s8 (src6, 1);
+  s.h = vaddv_s16 (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tstp\tw[0-9]+, w[0-9]+,} 4 } } */
+
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.s\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } }
+/* { dg-final { scan-assembler-not {\tstp\tq[0-9]+, q[0-9]+,} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-2.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-2.c
new file mode 100644
index 000000000000..2cacf4b1db7c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 4 elements loaded from memory is vectorized 
by
+   constructing a vector and storing it, given that the LD1 (ASIMD load, 1
+   element, one lane, B/H/S) instruction typically has similar throughput to 
the
+   LDR (Load vector reg, unscaled immed) instruction that would be used by the
+   scalar version of the same code.  Any additional latency of LD1 is assumed 
to
+   be represented by the basic cost of vector construction that is applied
+   uniformly.  Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  Only one STR (Store vector reg,
+   unsigned immed, B/H/S/D-form) instruction is required by the vectorized code
+   but the overall benefit of vectorization is expected to be marginal (except
+   on code size) because scalar loads are easier to parallelize, albeit at the
+   cost of using more SIMD registers.  */
+
+struct S { __fp16 a, b, c, d; } s;
+
+void
+foo (__fp16 *a, __fp16 *b, __fp16 *c, __fp16 *d)
+{
+  __fp16 a_ = *a, b_ = *b, c_ = *c, d_ = *d;
+  s.a = a_;
+  s.b = b_;
+  s.c = c_;
+  s.d = d_;
+}
+
+/* { dg-final { scan-assembler-times {\tld1\t{v[0-9]+\.h}\[[0-9]+\], 
\[x[0-9]+\]\n} 3 } } */
+/* (The fourth load is usually ldr hN, ... but don't require that.) */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, \[x[0-9]+.*\]\n} 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\th[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-3.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-3.c
new file mode 100644
index 000000000000..9dea1db0c911
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-3.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 4 elements held in scalar registers is
+   not vectorized by constructing a vector and storing it, given that the
+   GPR-to-SIMD version of INS (which would have had to be used to vectorize
+   this code) typically has higher latency and lower throughput than the
+   SIMD-to-SIMD version of INS (with which such a transformation might
+   have been profitable).  No increase in code size is expected as a
+   consequence of forgoing vectorization, either.  */
+
+struct S
+{
+  short a, b, c, d;
+} s;
+
+void
+foo (short a, short b, short c, short d)
+{
+  s.a = a;
+  s.b = b;
+  s.c = c;
+  s.d = d;
+}
+
+/* { dg-final { scan-assembler-times {\tstrh\tw[0-9]+, } 4 } } */
+
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.h\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-4.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-4.c
new file mode 100644
index 000000000000..c13a4a0c16f9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-4.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 4 elements loaded from memory is vectorized 
by
+   constructing a vector and storing it, given that the LD1 (ASIMD load, 1
+   element, one lane, B/H/S) instruction typically  has similar throughput to
+   the LDRSH (Load, immed offset) instruction that would be used by the scalar
+   version of the same code.  Any additional latency of LD1 is assumed to be
+   represented by the basic cost of vector construction that is applied
+   uniformly.  Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  The STR (Store vector reg,
+   unsigned immed, B/H/S/D-form) instruction typically has higher latency than
+   STRH (Store register, unsigned immed), but only one is required.  Despite
+   that, the overall benefit of vectorization is expected to be marginal 
(except
+   on code size) because scalar loads are easier to parallelize, albeit at the
+   cost of using more general purpose registers. */
+
+struct S
+{
+  short a, b, c, d;
+} s;
+
+void
+foo (short *a, short *b, short *c, short *d)
+{
+  short a_ = *a, b_ = *b, c_ = *c, d_ = *d;
+  s.a = a_;
+  s.b = b_;
+  s.c = c_;
+  s.d = d_;
+}
+
+/* { dg-final { scan-assembler-times {\tld1\t{v[0-9]+\.h}\[[0-9]+\], 
\[x[0-9]+\]\n} 3 } } */
+/* (The fourth load is usually ldr hN, ... but don't require that.) */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tldrsh\tw[0-9]+, \[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tstrh\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-5.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-5.c
new file mode 100644
index 000000000000..17cf4e9a4cae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-5.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements held in scalar registers is
+   not vectorized by constructing a vector and storing it, given that the
+   GPR-to-SIMD version of INS (which would have had to be used to vectorize
+   this code) typically has higher latency and lower throughput than the
+   SIMD-to-SIMD version of INS (with which such a transformation might
+   have been profitable).  No increase in code size is expected as a
+   consequence of forgoing vectorization, either.  */
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (char a, char b, char c, char d, char e, char f, char g, char h)
+{
+  s.a = a;
+  s.b = b;
+  s.c = c;
+  s.d = d;
+  s.e = e;
+  s.f = f;
+  s.g = g;
+  s.h = h;
+}
+
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, } 8 } } */
+
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-6.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-6.c
new file mode 100644
index 000000000000..41f8455b772b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-6.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements loaded from memory is
+   vectorized by constructing a vector and storing it, given that the
+   LD1 (ASIMD load, 1 element, one lane, B/H/S) instruction typically
+   has similar throughput to the LDRB (Load, immed offset) instruction
+   that would be used by the scalar code.  Any additional latency of LD1 is
+   assumed to be represented by the basic cost of vector construction that is
+   applied uniformly.  Since there are no GPR-to-SIMD register transfers, there
+   is no need to charge additional costs for them.  The STR (Store vector reg,
+   unsigned immed, B/H/S/D-form) instruction typically has higher latency than
+   STRB (Store register, unsigned immed), but only one is required.  Despite
+   that, the overall benefit of vectorization is expected to be marginal 
(except
+   on code size) because scalar loads are easier to parallelize, albeit at the
+   cost of using more general purpose registers. */
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (char *a, char *b, char *c, char *d, char *e, char *f, char *g, char *h)
+{
+  char a_ = *a, b_ = *b, c_ = *c, d_ = *d, e_ = *e, f_ = *f, g_ = *g, h_ = *h;
+  s.a = a_;
+  s.b = b_;
+  s.c = c_;
+  s.d = d_;
+  s.e = e_;
+  s.f = f_;
+  s.g = g_;
+  s.h = h_;
+}
+
+/* { dg-final { scan-assembler-times {\tld1\t{v[0-9]+\.b}\[[0-9]+\], 
\[x[0-9]+\]\n} 7 } } */
+/* (The eighth load is usually ldr bN, ... but don't require that.) */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tldrb\tw[0-9]+, \[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-7.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-7.c
new file mode 100644
index 000000000000..0acdc1b7edde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-7.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of byte-reversed stores of 8 elements derived from
+   scalar registers is not vectorized by constructing a vector and
+   storing it, given that the GPR-to-SIMD version of INS (which would
+   have had to be used to vectorize this code) typically has higher
+   latency and lower throughput than the SIMD-to-SIMD version of INS.
+   Actually, vectorization would not be profitable in either case because
+   it would have the unfortunate side-effect of preventing store-merging
+   that would otherwise happen in a later pass, which would prevent the
+   byte-reversing pattern from being recognised and lowered using scalar
+   instructions.  */
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (unsigned int b, unsigned int c)
+{
+  s.a = b >> 24;
+  s.b = b >> 16;
+  s.c = b >> 8;
+  s.d = b >> 0;
+  s.e = c >> 24;
+  s.f = c >> 16;
+  s.g = c >> 8;
+  s.h = c >> 0;
+}
+
+/* { dg-final { scan-assembler-times {\trev\t(w[0-9]+), \1\n} 2 } } */
+
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-8.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-8.c
new file mode 100644
index 000000000000..ebd1b80b8d43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-8.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from the results of calls
+   to a function that has only floating point parameters and returns a scalar
+   result is not vectorized by constructing a vector and storing it, given that
+   the GPR-to-SIMD version of INS (which would have had to be used to vectorize
+   this code) typically has higher latency and lower throughput than the
+   SIMD-to-SIMD version of INS.  This is a test for misidentification of 
builtin
+   reductions.  */
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+char __attribute__ ((noinline, const))
+bar (double x)
+{
+        return x;
+}
+
+void
+foo (double src0, double src1, double src2, double src3, double src4,
+     double src5, double src6, double src7)
+{
+  s.a = bar (src0);
+  s.b = bar (src1);
+  s.c = bar (src2);
+  s.d = bar (src3);
+  s.e = bar (src4);
+  s.f = bar (src5);
+  s.g = bar (src6);
+  s.h = bar (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tstrb\tw[0-9]+, } 8 } } */
+
+/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+.8b, w[0-9]+\n} } } */
+/* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } 
} */
+/* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-construct-9.c 
b/gcc/testsuite/gcc.target/aarch64/vec-construct-9.c
new file mode 100644
index 000000000000..479e80ac1c50
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-construct-9.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-slp-vectorize" } */
+
+/* Test that a group of stores of 8 elements derived from a horizontal
+   reduction is vectorized by constructing a vector and storing it.
+   Since there are no GPR-to-SIMD register transfers, there is no
+   need to charge additional costs for them.  Only one STR (Store
+   vector reg, unsigned immed, B/H/S/D-form) instruction is required
+   instead of 8.
+ */
+#include <arm_neon.h>
+
+struct S
+{
+  char a, b, c, d, e, f, g, h;
+} s;
+
+void
+foo (int8x8_t src0, int8x8_t src1, int8x8_t src2, int8x8_t src3, int8x8_t src4,
+     int8x8_t src5, int8x8_t src6, int8x8_t src7)
+{
+  s.a = vmaxv_s8 (src0);
+  s.b = vminv_s8 (src1);
+  s.c = vdupb_lane_s8 (src2, 2);
+  s.d = vaddv_s8 (src3);
+  s.e = vmaxv_s8 (src4);
+  s.f = vminv_s8 (src5);
+  s.g = vdupb_lane_s8 (src6, 1);
+  s.h = vaddv_s8 (src7);
+}
+
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+
+/* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
-- 
2.54.0


Reply via email to