On Tue, Mar 17, 2026 at 3:12 PM Michael Matz <[email protected]> wrote:
>
> Hello,
>
> On Tue, 17 Mar 2026, Richard Biener via Gcc wrote:
>
> > The issue is that (mem:<vectype> (reg:<vectype>)) does not play
> > nicely with the idea that a (mem:...) accesses contiguous memory
>
> That's the big thing indeed.  If it were only MEM_ATTRs the solution is
> simple: assert that there aren't any on those vMEMs (or as Andrew suggests
> later, only a sane subset).  It think there are more places that
> conceptually assume such a contiguous access, like disambiguation and
> similar, without MEM_ATTRs, in the sense that if those places think they
> have figured out a lower bound of the base address and an upper bound of
> the access size, then they assume that nothing outside that range is
> accessed.

That there can be (well-defined) conflicts within a scatter (WAW conflicts) does
not help either.  Either a RTL representation would disallow that, but then
intrinsics cannot map to this scheme, or we somehow have to deal with it.
I guess it should be a black box, meaning you cannot combine or split
a scatter into/from multiple scatters.

> I also think that all those could be fixed as well (e.g. by giving up).
>
> Furthermore I think we somewhen do need a proper representation of the
> concept behind scatter/gather MEMs, where "proper" is not "a RTL
> vec_concat of MEMs".  If we went that vec_concat route when vector modes
> were introduced and we had represented vector REGs as a vec_concat as
> well, instead of the current top-level RTL REG, we would all be mad by
> now.
>
> So, IMHO a top-level construct for "N-lane MEM access with N-lane
> addresses" is the right thing to do (and was, for a long time).  The only
> question is specifics: should it be a MEM, or a new top-level code?
> Should the only difference between a MEM as-of-now and the vMEM be the
> fact that the address has a vector mode?  Or flags on the MEM?
>
> (IMHO: MEM with vMODE addresses is enough, but see below for a case of
> new toplevel code).
>
> Which transformations should be allowed to be represented within the
> addresses?  Should it only be a vMODE REG?  Could it be more, like the
> scalar offset that's added to all lanes that Andrews architecture would
> have, or a scalar scale that's multiplied to each lane?  How to represent
> that?  If the vMEM would be a separate top-level RTL, it could have two
> slots, one for the base addresses (vMODE), and one for an arithmetic
> scalar transform applied to each lane (word_mode).  With a MEM that's more
> complicated and would somehow have to be wrapped in the vMODE address.
> But the latter might be convenient in other places as well, for instance
> when calculating such address vector without actual memory access.
>
> And so on...
>
> But I think when Andrew wants to put in the work to make this ... well,
> work, then it would be good for GCC.

I think the recent discussion on how to represent (len-)masking and else
values also comes into play here given at least we have masked variants
of gathers and scatters.

Richard.

>
>
> Ciao,
> Michael.
>
>
> > as indicated by MEM_ATTRs.  A "proper" representation for a gather
> > might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..)
> > (mem:<scalar> ..) ... ])
> > or as all targets(?) do, an UNSPEC.  That vec_concat_multiple could be
> > called vec_gather then, but I'd not imply the MEM here.  For GCN
> > You'd then have nested (subreg ..) of the address vector.  Quite ugly,
> > considering the large number of lanes for GCN.
> >
> > > Thanks in advance.
> > >
> > > Andrew
> > > ----------
> > >
> > >
> > > Background ...
> > >
> > > I've often said that on GCN "all loads and stores are gather/scatter",
> > > because there's no instruction for "load a whole vector starting at this
> > > base address". But, that's not really true, because, at least in GCC
> > > terminology, gather/scatter uses a scalar base address with a vector of
> > > offsets with a scalar multiplier, which GCN also *cannot* do. [1]
> > >
> > > What GCN *can* do is take a vector of arbitrary addresses and load/store
> > > all of them in parallel.  It can then add an identical scalar offset to
> > > each address.  There doesn't need to be any relationship, or pattern
> > > between the addresses (although I believe the hardware may well optimize
> > > accesses to contiguous data).  Each address refers to a single element
> > > of data, so it really is like gluing together N scalar load instructions
> > > into one.
> > >
> > > So, whenever GCC tries to load a contiguous vector, or does a
> > > gather_load or scatter_store, the backend converts this in to an unspec
> > > that has the vector of addresses, which could be much more neatly
> > > represented as a MEM with a vector "base".
> > >
> > > The last straw came when I wanted to implement vector atomics. The
> > > atomic instructions have a lot of if-then-else with cache handling for
> > > different device features, and I was looking at having to reproduce or
> > > refactor it all to add new insns that use new unspecs similar to the
> > > existing gather/scatter patterns, with all the different base+offset
> > > combinations.  Which would mean yet more places to touch each time we
> > > support a new device with a new cache configuration. But at the end of
> > > all of it, the actual instruction produced would be identical (apart
> > > from there being a different value in the vector mask register).
> > >
> > > I also anticipate that the new MEM will help with another project I'm
> > > working on right now.
> > >
> > >
> > >
> > >
> > > [1] The "global_load" instruction can do scalar_base+vector_offset (no
> > > multiplier), but only in one address space that is too limited for
> > > general use.  The more useful "flat_load" instruction is strictly vector
> > > addresses only.
> >

Reply via email to