Hi gcc-patches mailing list,
Christopher Bazley via Sourceware Forge 
<[email protected]> has requested that the 
following forgejo pull request
be published on the mailing list.

Created on: 2026-05-11 12:12:13+00:00
Latest update: 2026-05-11 12:27:43+00:00
Changes: 18 changed files, 771 additions, 0 deletions
Head revision: chris.bazley/gcc ref GNUTOOLS-16436-4 commit 
6044a2d73ff4d3a66cf2a8ee003255efa9e16c55
Base revision: gcc/gcc-TEST ref trunk commit 
c392d64098cc675c804ff4f516548d023a4fe29a r17-207-gc392d64098cc67
Merge base: c392d64098cc675c804ff4f516548d023a4fe29a
Full diff url: https://forge.sourceware.org/gcc/gcc-TEST/pulls/154.diff
Discussion:  https://forge.sourceware.org/gcc/gcc-TEST/pulls/154
Requested Reviewers:

An anti-pattern found in compiled code when predicated tails were
enabled for basic block SLP vectorization was triggered by
byte-reversing patterns in source code, such as:

uint8_t *dst;
int size;
dst[0] = size >> 24;
dst[1] = size >> 16;
dst[2] = size >> 8;
dst[3] = size >> 0;

which would previously have compiled to:

rev    w1, w1
str    w1, [x0]

but (with tail-predication) was vectorized as:

mov     z31.b, w1
ptrue   p7.s, vl4
fmov    s30, w1
sshr    v29.2s, v30.2s, #8
insr    z31.s, s29
sshr    v30.2s, v30.2s, #16
insr    z31.s, s30
fmov    s30, w1
sshr    v30.2s, v30.2s, #24
insr    z31.s, s30
st1b    {z31.s}, p7, [x0]

One reason is that the SLP pass runs before the store-merging
pass gets a chance to coalesce 4 stores into 1 and substitute a
32 bit bswap implementation. Even ignoring that, costing of the
vectorized version (cost: 4) compared to the scalar version
(also 4) was not realistic:

_2 1 times vector_store costs 1 in body
node 0x32ee6d0 1 times vec_construct costs 3 in prologue

There were a couple of contributing issues:
1. the cost of mask construction for the vector_store (ptrue) was
omitted for BB SLP, whereas the loop vectorizer explicitly charges
for it.
2. the cost of vec_construct (elements / 2 + 1) did not incorporate
any GPR-to-SIMD register transfer costs (mov, fmov).

Since the supposed cost of the vectorised code only just reached parity
with the scalar code, addressing either of the above issues would be
sufficient to prevent vectorisation (in this specific case). It is also
less risky than changing the order of passes, and less hacky than
teaching the SLP pass about store-merging.

This commit addresses only the second issue, by adding code in
vector_costs::add_stmt_cost to charge scalar_to_vec_cost for each
element of an external def of kind vec_construct (with specific
exceptions noted below). This cost is added to the base cost
already charged by aarch64_builtin_vectorization_cost for a
vec_construct (which is assumed to cover the cost of the INSR or
equivalent instructions).

This is justifiable because SIMD-to-SIMD insertions into a vector
register generally have lower latency and higher throughput than
GPR-to-SIMD insertions.

The basic structure of the code was copied from commit
90d693bdc9d71841f51d68826ffa5bd685d7f0bc which modified the x86
backend in a similar way, but adapted to use a hash_set<tree>
instead of TREE_VISITED to guard against charging twice or more for
the same scalar op feeding an external def.

This commit assumes that constructing a vector from memory
is no more costly than the equivalent set of scalar loads (or at least
that any difference is incorporated in the cost returned by
aarch64_builtin_vectorization_cost for vec_construct). It also assumes
that constructing a vector from scalar values of floating point type,
from a BIT_FIELD_REF/lastb that extracts from a vector register, or
from the result of a call to an inbuilt reduction function, does not
incur GPR-to-SIMD register transfer costs because such scalars are
typically already in FP/SIMD registers on AArch64.


Changed files:
- A: gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
- A: gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
- A: gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
- A: gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
- A: gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-1.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-10.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-11.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-12.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-2.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-3.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-4.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-5.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-6.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-7.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-8.c
- A: gcc/testsuite/gcc.target/aarch64/vec-construct-9.c
- M: gcc/config/aarch64/aarch64.cc


Christopher Bazley (1):
  AArch64: Add scalar-to-vector costs for vec_construct

 gcc/config/aarch64/aarch64.cc                 | 147 ++++++++++++++++++
 .../gcc.target/aarch64/sve/vec_construct_1.c  |  37 +++++
 .../gcc.target/aarch64/sve/vec_construct_2.c  |  42 +++++
 .../gcc.target/aarch64/sve/vec_construct_3.c  |  39 +++++
 .../gcc.target/aarch64/sve/vec_construct_4.c  |  37 +++++
 .../gcc.target/aarch64/sve/vec_construct_5.c  |  37 +++++
 .../gcc.target/aarch64/vec-construct-1.c      |  28 ++++
 .../gcc.target/aarch64/vec-construct-10.c     |  42 +++++
 .../gcc.target/aarch64/vec-construct-11.c     |  37 +++++
 .../gcc.target/aarch64/vec-construct-12.c     |  35 +++++
 .../gcc.target/aarch64/vec-construct-2.c      |  33 ++++
 .../gcc.target/aarch64/vec-construct-3.c      |  30 ++++
 .../gcc.target/aarch64/vec-construct-4.c      |  38 +++++
 .../gcc.target/aarch64/vec-construct-5.c      |  34 ++++
 .../gcc.target/aarch64/vec-construct-6.c      |  42 +++++
 .../gcc.target/aarch64/vec-construct-7.c      |  37 +++++
 .../gcc.target/aarch64/vec-construct-8.c      |  41 +++++
 .../gcc.target/aarch64/vec-construct-9.c      |  35 +++++
 18 files changed, 771 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-10.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-11.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-12.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-construct-9.c

-- 
2.54.0

Reply via email to