On Tue, Mar 17, 2026 at 1:01 PM Andrew Stubbs <[email protected]> wrote: > > Is there any reason why a MEM cannot take a vector of addresses, other > than the few cases fixed in the attached patch? > > It would make perfect sense for AMD GCN to do this, so I would like to > know if such a patch would be acceptable to the maintainers, or if there > are likely to be technical showstoppers? (Initial testing of the > prototype patches seems promising). > > I've attached 3 prototype patches to illustrate (not really for review): > > 1. Enough middle-end changes to not ICE. > > 2. The amdgcn backend changes to make such MEMs "legitimate", add the > instructions and constraints that can use them, and add support for the > different forms in print_operand. (There's a few bits regarding > vec_duplicate of offsets that are the result of some experimentation I > did and are not strictly in use here, but you can get the idea, I think.) > > 3. A basic implementation of the vector atomics that motivated this > request in the first place, but is not strictly "part of it". > > Obviously, none of this is for GCC 16.
The issue is that (mem:<vectype> (reg:<vectype>)) does not play nicely with the idea that a (mem:...) accesses contiguous memory as indicated by MEM_ATTRs. A "proper" representation for a gather might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..) (mem:<scalar> ..) ... ]) or as all targets(?) do, an UNSPEC. That vec_concat_multiple could be called vec_gather then, but I'd not imply the MEM here. For GCN You'd then have nested (subreg ..) of the address vector. Quite ugly, considering the large number of lanes for GCN. > Thanks in advance. > > Andrew > ---------- > > > Background ... > > I've often said that on GCN "all loads and stores are gather/scatter", > because there's no instruction for "load a whole vector starting at this > base address". But, that's not really true, because, at least in GCC > terminology, gather/scatter uses a scalar base address with a vector of > offsets with a scalar multiplier, which GCN also *cannot* do. [1] > > What GCN *can* do is take a vector of arbitrary addresses and load/store > all of them in parallel. It can then add an identical scalar offset to > each address. There doesn't need to be any relationship, or pattern > between the addresses (although I believe the hardware may well optimize > accesses to contiguous data). Each address refers to a single element > of data, so it really is like gluing together N scalar load instructions > into one. > > So, whenever GCC tries to load a contiguous vector, or does a > gather_load or scatter_store, the backend converts this in to an unspec > that has the vector of addresses, which could be much more neatly > represented as a MEM with a vector "base". > > The last straw came when I wanted to implement vector atomics. The > atomic instructions have a lot of if-then-else with cache handling for > different device features, and I was looking at having to reproduce or > refactor it all to add new insns that use new unspecs similar to the > existing gather/scatter patterns, with all the different base+offset > combinations. Which would mean yet more places to touch each time we > support a new device with a new cache configuration. But at the end of > all of it, the actual instruction produced would be identical (apart > from there being a different value in the vector mask register). > > I also anticipate that the new MEM will help with another project I'm > working on right now. > > > > > [1] The "global_load" instruction can do scalar_base+vector_offset (no > multiplier), but only in one address space that is too limited for > general use. The more useful "flat_load" instruction is strictly vector > addresses only.
