On Thu, Mar 19, 2026 at 5:14 AM Michael Clark <[email protected]> wrote:
>
> On 3/19/26 03:28, Richard Biener via Gcc wrote:
> > Yes, but for example on x86 the memory order for scatters is defined to
> > be left-to-right.
>
> which side is lane zero on? left or right? ;-)
>
> on little-endian I'd expect right-to-left if lane 0 were on the right.
>
> there’s a subtle interaction between TSO and vector scatters. although a
> vector instruction has a logical width (e.g. 512-bit), in practice it
> decomposes into architectural subgroups (often ~128-bit) and further
> into micro-architectural quanta (e.g. 256×2 execution, banked caches,
> store buffers). execution and visibility therefore occur in smaller,
> partially parallel units of execution, and different threads may observe
> partial progress depending on micro-architectural effects.
>
> TSO constrains the ordering of stores as they become globally visible,
> but it does not inherently require any ordering within a single vector
> instruction. intra-vector lane ordering is orthogonal to TSO.
>
> however, x86 defines a lane order for scatters (e.g. to determine which
> store wins on conflicts), introducing an additional ordering constraint
> at the ISA level that is not strictly required by TSO itself. while this
> does not imply full serialization or per-lane global visibility, it does
> constrain the abstract behavior more than a fully relaxed model would.
It's probably modeled so you can vectorize
for (int i = 0; i < n; ++i)
a[b[i]] = 42;
because there's no way you can codify a runtime check to ensure
b[i] do not produce the same index. AVX512F (which introduces scatters)
is complemented by AVX512CD to be able to also handle
for (int i = 0; i < n; ++i)
a[b[i]]++;
because this is not only WAW but RAW conflicts which makes splitting
the iteration domain upon conflicts necessary.
That said - the *scatter* optabs assume naive vectorization of the
first loop works, even when b[] = { 0, 1, 2, 0 }, so if GCN is not able
to guarantee this their "vector address store" are not scatters in
terms of what GCC assumes. The documentation for the optabs
does not mention this constraint.
> this can be viewed as a potential performance cliff: a very strong
> ordering constraint applied within a single instruction, despite being
> orthogonal to the guarantees provided by TSO. I think it is too strong,
True, this is one thing that can make scatters slow, in particular on
architectures like GCN.
> in practice, implementations still split, merge, and overlap work, and
> observable behavior is shaped by store buffering, cache banking, and
> other micro-architectural details. the effective execution width is
> therefore dynamic and may differ from the nominal vector width.
>
> as a result, relying on predictable outcomes for conflicting scatter
> lanes seems unwise to me. a more robust model is to treat scatters as
> decomposed operations with largely unspecified internal ordering, even
> if the ISA defines a resolution rule for conflicts.
That would then still ask for a (non-UNSPEC) way to encode such
ordering guarantee in RTL for the x86 guaranteed semantics (I suspect
SVE as well) and RVV semantics (it has even stronger ordering guarantee
than x86).
Richard.
> Michael.