On Tue, Mar 17, 2026 at 3:12 PM Michael Matz <[email protected]> wrote: > > Hello, > > On Tue, 17 Mar 2026, Richard Biener via Gcc wrote: > > > The issue is that (mem:<vectype> (reg:<vectype>)) does not play > > nicely with the idea that a (mem:...) accesses contiguous memory > > That's the big thing indeed. If it were only MEM_ATTRs the solution is > simple: assert that there aren't any on those vMEMs (or as Andrew suggests > later, only a sane subset). It think there are more places that > conceptually assume such a contiguous access, like disambiguation and > similar, without MEM_ATTRs, in the sense that if those places think they > have figured out a lower bound of the base address and an upper bound of > the access size, then they assume that nothing outside that range is > accessed.
That there can be (well-defined) conflicts within a scatter (WAW conflicts) does not help either. Either a RTL representation would disallow that, but then intrinsics cannot map to this scheme, or we somehow have to deal with it. I guess it should be a black box, meaning you cannot combine or split a scatter into/from multiple scatters. > I also think that all those could be fixed as well (e.g. by giving up). > > Furthermore I think we somewhen do need a proper representation of the > concept behind scatter/gather MEMs, where "proper" is not "a RTL > vec_concat of MEMs". If we went that vec_concat route when vector modes > were introduced and we had represented vector REGs as a vec_concat as > well, instead of the current top-level RTL REG, we would all be mad by > now. > > So, IMHO a top-level construct for "N-lane MEM access with N-lane > addresses" is the right thing to do (and was, for a long time). The only > question is specifics: should it be a MEM, or a new top-level code? > Should the only difference between a MEM as-of-now and the vMEM be the > fact that the address has a vector mode? Or flags on the MEM? > > (IMHO: MEM with vMODE addresses is enough, but see below for a case of > new toplevel code). > > Which transformations should be allowed to be represented within the > addresses? Should it only be a vMODE REG? Could it be more, like the > scalar offset that's added to all lanes that Andrews architecture would > have, or a scalar scale that's multiplied to each lane? How to represent > that? If the vMEM would be a separate top-level RTL, it could have two > slots, one for the base addresses (vMODE), and one for an arithmetic > scalar transform applied to each lane (word_mode). With a MEM that's more > complicated and would somehow have to be wrapped in the vMODE address. > But the latter might be convenient in other places as well, for instance > when calculating such address vector without actual memory access. > > And so on... > > But I think when Andrew wants to put in the work to make this ... well, > work, then it would be good for GCC. I think the recent discussion on how to represent (len-)masking and else values also comes into play here given at least we have masked variants of gathers and scatters. Richard. > > > Ciao, > Michael. > > > > as indicated by MEM_ATTRs. A "proper" representation for a gather > > might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..) > > (mem:<scalar> ..) ... ]) > > or as all targets(?) do, an UNSPEC. That vec_concat_multiple could be > > called vec_gather then, but I'd not imply the MEM here. For GCN > > You'd then have nested (subreg ..) of the address vector. Quite ugly, > > considering the large number of lanes for GCN. > > > > > Thanks in advance. > > > > > > Andrew > > > ---------- > > > > > > > > > Background ... > > > > > > I've often said that on GCN "all loads and stores are gather/scatter", > > > because there's no instruction for "load a whole vector starting at this > > > base address". But, that's not really true, because, at least in GCC > > > terminology, gather/scatter uses a scalar base address with a vector of > > > offsets with a scalar multiplier, which GCN also *cannot* do. [1] > > > > > > What GCN *can* do is take a vector of arbitrary addresses and load/store > > > all of them in parallel. It can then add an identical scalar offset to > > > each address. There doesn't need to be any relationship, or pattern > > > between the addresses (although I believe the hardware may well optimize > > > accesses to contiguous data). Each address refers to a single element > > > of data, so it really is like gluing together N scalar load instructions > > > into one. > > > > > > So, whenever GCC tries to load a contiguous vector, or does a > > > gather_load or scatter_store, the backend converts this in to an unspec > > > that has the vector of addresses, which could be much more neatly > > > represented as a MEM with a vector "base". > > > > > > The last straw came when I wanted to implement vector atomics. The > > > atomic instructions have a lot of if-then-else with cache handling for > > > different device features, and I was looking at having to reproduce or > > > refactor it all to add new insns that use new unspecs similar to the > > > existing gather/scatter patterns, with all the different base+offset > > > combinations. Which would mean yet more places to touch each time we > > > support a new device with a new cache configuration. But at the end of > > > all of it, the actual instruction produced would be identical (apart > > > from there being a different value in the vector mask register). > > > > > > I also anticipate that the new MEM will help with another project I'm > > > working on right now. > > > > > > > > > > > > > > > [1] The "global_load" instruction can do scalar_base+vector_offset (no > > > multiplier), but only in one address space that is too limited for > > > general use. The more useful "flat_load" instruction is strictly vector > > > addresses only. > >
