Hello,

On Tue, 17 Mar 2026, Richard Biener via Gcc wrote:

> The issue is that (mem:<vectype> (reg:<vectype>)) does not play
> nicely with the idea that a (mem:...) accesses contiguous memory

That's the big thing indeed.  If it were only MEM_ATTRs the solution is 
simple: assert that there aren't any on those vMEMs (or as Andrew suggests 
later, only a sane subset).  It think there are more places that 
conceptually assume such a contiguous access, like disambiguation and 
similar, without MEM_ATTRs, in the sense that if those places think they 
have figured out a lower bound of the base address and an upper bound of 
the access size, then they assume that nothing outside that range is 
accessed.

I also think that all those could be fixed as well (e.g. by giving up).

Furthermore I think we somewhen do need a proper representation of the 
concept behind scatter/gather MEMs, where "proper" is not "a RTL 
vec_concat of MEMs".  If we went that vec_concat route when vector modes 
were introduced and we had represented vector REGs as a vec_concat as 
well, instead of the current top-level RTL REG, we would all be mad by 
now.

So, IMHO a top-level construct for "N-lane MEM access with N-lane 
addresses" is the right thing to do (and was, for a long time).  The only 
question is specifics: should it be a MEM, or a new top-level code?  
Should the only difference between a MEM as-of-now and the vMEM be the 
fact that the address has a vector mode?  Or flags on the MEM?

(IMHO: MEM with vMODE addresses is enough, but see below for a case of 
new toplevel code).

Which transformations should be allowed to be represented within the 
addresses?  Should it only be a vMODE REG?  Could it be more, like the 
scalar offset that's added to all lanes that Andrews architecture would 
have, or a scalar scale that's multiplied to each lane?  How to represent 
that?  If the vMEM would be a separate top-level RTL, it could have two 
slots, one for the base addresses (vMODE), and one for an arithmetic 
scalar transform applied to each lane (word_mode).  With a MEM that's more 
complicated and would somehow have to be wrapped in the vMODE address.  
But the latter might be convenient in other places as well, for instance 
when calculating such address vector without actual memory access.

And so on...

But I think when Andrew wants to put in the work to make this ... well, 
work, then it would be good for GCC.


Ciao,
Michael.


> as indicated by MEM_ATTRs.  A "proper" representation for a gather
> might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..)
> (mem:<scalar> ..) ... ])
> or as all targets(?) do, an UNSPEC.  That vec_concat_multiple could be
> called vec_gather then, but I'd not imply the MEM here.  For GCN
> You'd then have nested (subreg ..) of the address vector.  Quite ugly,
> considering the large number of lanes for GCN.
> 
> > Thanks in advance.
> >
> > Andrew
> > ----------
> >
> >
> > Background ...
> >
> > I've often said that on GCN "all loads and stores are gather/scatter",
> > because there's no instruction for "load a whole vector starting at this
> > base address". But, that's not really true, because, at least in GCC
> > terminology, gather/scatter uses a scalar base address with a vector of
> > offsets with a scalar multiplier, which GCN also *cannot* do. [1]
> >
> > What GCN *can* do is take a vector of arbitrary addresses and load/store
> > all of them in parallel.  It can then add an identical scalar offset to
> > each address.  There doesn't need to be any relationship, or pattern
> > between the addresses (although I believe the hardware may well optimize
> > accesses to contiguous data).  Each address refers to a single element
> > of data, so it really is like gluing together N scalar load instructions
> > into one.
> >
> > So, whenever GCC tries to load a contiguous vector, or does a
> > gather_load or scatter_store, the backend converts this in to an unspec
> > that has the vector of addresses, which could be much more neatly
> > represented as a MEM with a vector "base".
> >
> > The last straw came when I wanted to implement vector atomics. The
> > atomic instructions have a lot of if-then-else with cache handling for
> > different device features, and I was looking at having to reproduce or
> > refactor it all to add new insns that use new unspecs similar to the
> > existing gather/scatter patterns, with all the different base+offset
> > combinations.  Which would mean yet more places to touch each time we
> > support a new device with a new cache configuration. But at the end of
> > all of it, the actual instruction produced would be identical (apart
> > from there being a different value in the vector mask register).
> >
> > I also anticipate that the new MEM will help with another project I'm
> > working on right now.
> >
> >
> >
> >
> > [1] The "global_load" instruction can do scalar_base+vector_offset (no
> > multiplier), but only in one address space that is too limited for
> > general use.  The more useful "flat_load" instruction is strictly vector
> > addresses only.
> 

Reply via email to