On 17/03/2026 14:12, Michael Matz wrote:
Hello,
On Tue, 17 Mar 2026, Richard Biener via Gcc wrote:
The issue is that (mem:<vectype> (reg:<vectype>)) does not play
nicely with the idea that a (mem:...) accesses contiguous memory
That's the big thing indeed. If it were only MEM_ATTRs the solution is
simple: assert that there aren't any on those vMEMs (or as Andrew suggests
later, only a sane subset). It think there are more places that
conceptually assume such a contiguous access, like disambiguation and
similar, without MEM_ATTRs, in the sense that if those places think they
have figured out a lower bound of the base address and an upper bound of
the access size, then they assume that nothing outside that range is
accessed.
I also think that all those could be fixed as well (e.g. by giving up).
I have run out of places that ICE or cause test failures (for C) when
the mode is a vector. Do you have any other suggestions to find them?
(I already plan to test C++ and Fortran.)
Furthermore I think we somewhen do need a proper representation of the
concept behind scatter/gather MEMs, where "proper" is not "a RTL
vec_concat of MEMs". If we went that vec_concat route when vector modes
were introduced and we had represented vector REGs as a vec_concat as
well, instead of the current top-level RTL REG, we would all be mad by
now.
So, IMHO a top-level construct for "N-lane MEM access with N-lane
addresses" is the right thing to do (and was, for a long time). The only
question is specifics: should it be a MEM, or a new top-level code?
Should the only difference between a MEM as-of-now and the vMEM be the
fact that the address has a vector mode? Or flags on the MEM?
I would like it to be a MEM for the simple reason that
legitimate_address/legitimize_address allow much more flexibility than
having to match the whole RTL pattern.
MEMs can be used with match_operand in ways that other constructs cannot
(AFAIK). Combine et al know how to handle them both as a whole and as
parts. We can treat them as arbitrary operands all the way to constraint
resolution, if we choose.
Adding a new code would avoid existing broken assumptions, but would
trade that for losing all the support that still holds. It's probably
easier (more realistic) for me to disable things that don't work with
vectors than to replace all the things that do work, throughout the RTL
passes.
(IMHO: MEM with vMODE addresses is enough, but see below for a case of
new toplevel code).
Which transformations should be allowed to be represented within the
addresses? Should it only be a vMODE REG? Could it be more, like the
scalar offset that's added to all lanes that Andrews architecture would
have, or a scalar scale that's multiplied to each lane? How to represent
that? If the vMEM would be a separate top-level RTL, it could have two
slots, one for the base addresses (vMODE), and one for an arithmetic
scalar transform applied to each lane (word_mode). With a MEM that's more
complicated and would somehow have to be wrapped in the vMODE address.
But the latter might be convenient in other places as well, for instance
when calculating such address vector without actual memory access.
My prototype legitimate_address+print_operand isn't quite
right/finished, but I think I'll need it permit these:
; Straight vector of addresses
(mem (reg:V64DI))
; Vector of addresses, plus scalar offset
(mem (plus:V64DI
(reg:V64DI)
(const_vector:V64DI (const_int))))
; Scalar base, plus vector of offsets
(mem (plus:V64DI
(vec_duplicate:V64DI (reg:DI))
(zero_extend:V64DI (reg:V64SI))))
; Scalar base, plus vector of offsets, plus scalar offset
(mem (plus:V64DI
(plus:V64DI
(vec_duplicate:V64DI (reg:DI))
(zero_extend:V64DI (reg:V64SI)))
(const_vector:V64DI (const_int))))
In each case, the const_vector needs to be the same element repeated. It
could be represented as (vec_duplicate:V64DI (const_int)) also, but that
was less convenient, when I was playing with it.
We currently implement these as two independent "gather_load"
define_insn instances, using "plus zero" to represent the ones without
scalar offsets. Using MEM and constraints I have replaced all those with
an alternative in the "mov<mode>" insn.
And so on...
But I think when Andrew wants to put in the work to make this ... well,
work, then it would be good for GCC.
I don't have unlimited budget/schedule on this, but I'd like to give it
a shot.
Ciao,
Michael.
Thanks for your support.
Andrew
as indicated by MEM_ATTRs. A "proper" representation for a gather
might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..)
(mem:<scalar> ..) ... ])
or as all targets(?) do, an UNSPEC. That vec_concat_multiple could be
called vec_gather then, but I'd not imply the MEM here. For GCN
You'd then have nested (subreg ..) of the address vector. Quite ugly,
considering the large number of lanes for GCN.
Thanks in advance.
Andrew
----------
Background ...
I've often said that on GCN "all loads and stores are gather/scatter",
because there's no instruction for "load a whole vector starting at this
base address". But, that's not really true, because, at least in GCC
terminology, gather/scatter uses a scalar base address with a vector of
offsets with a scalar multiplier, which GCN also *cannot* do. [1]
What GCN *can* do is take a vector of arbitrary addresses and load/store
all of them in parallel. It can then add an identical scalar offset to
each address. There doesn't need to be any relationship, or pattern
between the addresses (although I believe the hardware may well optimize
accesses to contiguous data). Each address refers to a single element
of data, so it really is like gluing together N scalar load instructions
into one.
So, whenever GCC tries to load a contiguous vector, or does a
gather_load or scatter_store, the backend converts this in to an unspec
that has the vector of addresses, which could be much more neatly
represented as a MEM with a vector "base".
The last straw came when I wanted to implement vector atomics. The
atomic instructions have a lot of if-then-else with cache handling for
different device features, and I was looking at having to reproduce or
refactor it all to add new insns that use new unspecs similar to the
existing gather/scatter patterns, with all the different base+offset
combinations. Which would mean yet more places to touch each time we
support a new device with a new cache configuration. But at the end of
all of it, the actual instruction produced would be identical (apart
from there being a different value in the vector mask register).
I also anticipate that the new MEM will help with another project I'm
working on right now.
[1] The "global_load" instruction can do scalar_base+vector_offset (no
multiplier), but only in one address space that is too limited for
general use. The more useful "flat_load" instruction is strictly vector
addresses only.