On Thu, Mar 19, 2026 at 10:36 PM Michael Clark <[email protected]> wrote:
>
> On 3/19/26 21:25, Richard Biener via Gcc wrote:
> > On Thu, Mar 19, 2026 at 5:14 AM Michael Clark <[email protected]> wrote:
> >>
> >> On 3/19/26 03:28, Richard Biener via Gcc wrote:
> >>> Yes, but for example on x86 the memory order for scatters is defined to
> >>> be left-to-right.
> >>
> >> which side is lane zero on? left or right? ;-)
> >>
> >> on little-endian I'd expect right-to-left if lane 0 were on the right.
> >>
> >> there’s a subtle interaction between TSO and vector scatters. although a
> >> vector instruction has a logical width (e.g. 512-bit), in practice it
> >> decomposes into architectural subgroups (often ~128-bit) and further
> >> into micro-architectural quanta (e.g. 256×2 execution, banked caches,
> >> store buffers). execution and visibility therefore occur in smaller,
> >> partially parallel units of execution, and different threads may observe
> >> partial progress depending on micro-architectural effects.
> >>
> >> TSO constrains the ordering of stores as they become globally visible,
> >> but it does not inherently require any ordering within a single vector
> >> instruction. intra-vector lane ordering is orthogonal to TSO.
> >>
> >> however, x86 defines a lane order for scatters (e.g. to determine which
> >> store wins on conflicts), introducing an additional ordering constraint
> >> at the ISA level that is not strictly required by TSO itself. while this
> >> does not imply full serialization or per-lane global visibility, it does
> >> constrain the abstract behavior more than a fully relaxed model would.
> >
> > It's probably modeled so you can vectorize
> >
> >    for (int i = 0; i < n; ++i)
> >      a[b[i]] = 42;
> >
> > because there's no way you can codify a runtime check to ensure
> > b[i] do not produce the same index.  AVX512F (which introduces scatters)
> > is complemented by AVX512CD to be able to also handle
> >
> >    for (int i = 0; i < n; ++i)
> >      a[b[i]]++;
> >
> > because this is not only WAW but RAW conflicts which makes splitting
> > the iteration domain upon conflicts necessary.
> >
> > That said - the *scatter* optabs assume naive vectorization of the
> > first loop works, even when b[] = { 0, 1, 2, 0 }, so if GCN is not able
> > to guarantee this their "vector address store" are not scatters in
> > terms of what GCC assumes.  The documentation for the optabs
> > does not mention this constraint.
>
> I would define lane order for a simulation to be something along the
> lines of: for each n from 1 to 7 (128-bit masks), give a primitive
> polynomial over GF(2) of degree n that generates a maximal-length
> pseudorandom sequence (i.e. cycles through all non zero elements of
> GF(2ⁿ)). and then zero will be appended afterwards as a convention.
> maybe seed it with the cycle counter for less predictability.
>
> >> this can be viewed as a potential performance cliff: a very strong
> >> ordering constraint applied within a single instruction, despite being
> >> orthogonal to the guarantees provided by TSO. I think it is too strong,
> >
> > True, this is one thing that can make scatters slow, in particular on
> > architectures like GCN.
>
> okay, but I don't write parallel code that depends on determinism for
> conflicts, and most GPU compilers don't either, at least afaik. it seems
> to be a degenerate case. there are lots of things on computers that
> support misconfigurations. and I read an advisory about the conflict
> detection instructions potentially being deprecated somewhere. I would
> double check that but I don't intend to use them for histograms.
>
> >> in practice, implementations still split, merge, and overlap work, and
> >> observable behavior is shaped by store buffering, cache banking, and
> >> other micro-architectural details. the effective execution width is
> >> therefore dynamic and may differ from the nominal vector width.
> >>
> >> as a result, relying on predictable outcomes for conflicting scatter
> >> lanes seems unwise to me. a more robust model is to treat scatters as
> >> decomposed operations with largely unspecified internal ordering, even
> >> if the ISA defines a resolution rule for conflicts.
> >
> > That would then still ask for a (non-UNSPEC) way to encode such
> > ordering guarantee in RTL for the x86 guaranteed semantics (I suspect
> > SVE as well) and RVV semantics (it has even stronger ordering guarantee
> > than x86).
>
> so it seems to be defined for X86 but it doesn't mean we can't do it
> differently somewhere else. this deserves a longer reply. I see the
> issue with concatenation where the nominal width is the instruction
> width, but with some 'threaded-for' block in IR, the nominal width could
> be wider than the instruction width and support concatenation. because
> yes, TSO would come into play beyond one instruction, where one
> instruction had say "random" simulation order or an independence claim
> which one would presume requires unique and non-overlapping addresses.
>
> at least that is how I would intend to use scatter store and gather
> load. we are not running these memory operations LOCK synchronized. the
> guarantee here is that vector operations run in program order and each
> of the memory addresses have independence. that is what I want. I am
> doing a parallel scatter/gather to/from an array of structures and its
> just convenient to parallel sum the stride into a vector of addresses.

Yes, that's true when you are writing parallel code.  For vectorization we
have to preserve the scalar program order behavior though, which is where
those "vector ISA" (as opposed to "thread ISA") scatter instruction
guarantees are very useful.

Richard.

>
> Michael.

Reply via email to