On 23/07/2025 13:24, Richard Biener wrote:
On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote:
From: Julian Brown <jul...@codesourcery.com>
This patch was originally written by Julian in 2021 for the OG10 branch,
but does not appear to have been proposed for upstream at that time, or
since. I've now forward ported it and retested it. Thomas reported
test regressions with this patch on the OG14 branch, but I think it was
exposing some bugs in the backend; I can't reproduce those failures on
mainline.
I'm not sure what the original motivating test case was, but I see that
the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k
lines of assembler down to <7k, on amdgcn.
OK for mainline?
I do wonder if the single_element_p check isn't for correctness? And how
the patch makes a difference when we still require SLP_TREE_LANES
(slp_node) == 1?
The SLP_TREE_LANES thing was already true in the case where
single_element_p was set, so I had assumed that was acceptable. I don't
know enough about this stuff to completely understand what Julian has
done here (and I don't suppose there's much point in asking him now,
after all this time).
Without this patch, the testcase has this in the "vect" dump:
_1417 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1415];
ivtmp_1418 = ivtmp_1415 + _1147;
_1419 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1418];
ivtmp_1420 = ivtmp_1418 + _1147;
_1421 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1420];
ivtmp_1422 = ivtmp_1420 + _1147;
_1423 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1422];
ivtmp_1424 = ivtmp_1422 + _1147;
_1425 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1424];
ivtmp_1426 = ivtmp_1424 + _1147;
_1427 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1426];
:
:
[repeats 32 times]
:
:
vect_cst__1481 = {_1417, _1419, _1421, _1423, _1425, _1427, ......
_1482 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1480];
ivtmp_1483 = ivtmp_1480 + _1147;
_1484 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1483];
ivtmp_1485 = ivtmp_1483 + _1147;
_1486 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1485];
ivtmp_1487 = ivtmp_1485 + _1147;
_1488 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1487];
ivtmp_1489 = ivtmp_1487 + _1147;
:
:
[repeats 32 times]
:
:
vect_cst__1545 = {_1482, _1484, _1486, _1488, _1490, _1492, ......
vect__42.84_1546 = VEC_PERM_EXPR <vect_cst__1481, vect_cst__1545,
{ 2, 4, 6, 8, 10, 12, 14, 16, .....
And with the patch:
_472 = stride.3_51 * 16;
_473 = (sizetype) _472;
_474 = VEC_SERIES_EXPR <0, _473>;
vect__42.66_504 = .MASK_GATHER_LOAD (vectp.64_501, _474, 1, { 0.0, ...
The pattern occurs a few times in this testcase. (Although, there remain
a number of cases where the vectorizer is falling back to elementwise
BIT_FIELD_REF that I would prefer were done in some vectorized way.)
Richard.
Andrew
------------
For AMD GCN, the instructions available for loading/storing vectors are
always scatter/gather operations (i.e. there are separate addresses for
each vector lane), so the current heuristic to avoid gather/scatter
operations with too many elements in get_group_load_store_type is
counterproductive. Avoiding such operations in that function can
subsequently lead to a missed vectorization opportunity whereby later
analyses in the vectorizer try to use a very wide array type which is
not available on this target, and thus it bails out.
This patch adds a target hook to override the "single_element_p"
heuristic in the function as a target hook, and activates it for GCN. This
allows much better code to be generated for affected loops.
Co-authored-by: Julian Brown <jul...@codesourcery.com>
gcc/
* doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add
documentation hook.
* doc/tm.texi: Regenerate.
* target.def (prefer_gather_scatter): Add target hook under vectorizer.
* tree-vect-stmts.cc (get_group_load_store_type): Optionally prefer
gather/scatter instructions to scalar/elementwise fallback.
* config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Define
hook.
---
gcc/config/gcn/gcn.cc | 2 ++
gcc/doc/tm.texi | 5 +++++
gcc/doc/tm.texi.in | 2 ++
gcc/target.def | 8 ++++++++
gcc/tree-vect-stmts.cc | 2 +-
5 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 3b26d5c6a58..d451bf43355 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl)
gcn_vector_alignment_reachable
#undef TARGET_VECTOR_MODE_SUPPORTED_P
#define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
+#undef TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true
#undef TARGET_DOCUMENTATION_NAME
#define TARGET_DOCUMENTATION_NAME "AMD GCN"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 5e305643b3a..29177d81466 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not
vectorize scatter
stores.
@end deftypefn
+@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+This hook is set to TRUE if gather loads or scatter stores are cheaper on
+this target than a sequence of elementwise loads or stores.
+@end deftypevr
+
@deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
(struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree},
@var{int}, @var{bool})
This hook should set @var{vecsize_mangle}, @var{vecsize_int},
@var{vecsize_float}
fields in @var{simd_clone} structure pointed by @var{clone_info} argument and
also
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index eccc4d88493..b03ad4c97c6 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4311,6 +4311,8 @@ address; but often a machine-dependent strategy can
generate better code.
@hook TARGET_VECTORIZE_BUILTIN_SCATTER
+@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+
@hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
@hook TARGET_SIMD_CLONE_ADJUST
diff --git a/gcc/target.def b/gcc/target.def
index 38903eb567a..dd57b7072af 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2056,6 +2056,14 @@ all zeros. GCC can then try to branch around the instruction
instead.",
(unsigned ifn),
default_empty_mask_is_expensive)
+/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\
+we cannot use a contiguous access. */
+DEFHOOKPOD
+(prefer_gather_scatter,
+ "This hook is set to TRUE if gather loads or scatter stores are cheaper on\n\
+this target than a sequence of elementwise loads or stores.",
+ bool, false)
+
/* Target builtin that implements vector gather operation. */
DEFHOOK
(builtin_gather,
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 2e9b3d2e686..8ca33f5951a 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info
stmt_info,
allows us to use contiguous accesses. */
if ((*memory_access_type == VMAT_ELEMENTWISE
|| *memory_access_type == VMAT_STRIDED_SLP)
- && single_element_p
+ && (targetm.vectorize.prefer_gather_scatter || single_element_p)
&& SLP_TREE_LANES (slp_node) == 1
&& loop_vinfo
&& vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
--
2.50.0