[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

peter at cordes dot ca Wed, 24 May 2017 14:43:40 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #1)
> That is, it was supposed to end up using pslldq

I think you mean PSRLDQ.  Byte zero is the right-most when drawn in a way that
makes bit/byte shift directions all match up with the diagram.  Opposite of
array-initializer order.


PSRLDQ is sub-optimal without AVX.  It needs a MOVDQA to copy-and-shuffle.  For
integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with
16-bit elements).

None of the x86 copy-and-shuffle instructions can zero an element, only copy
from one of the source elements.  PSHUFD can easily swap the high and low
halves, but maybe some other targets can't do that as efficiently as just
duplicating the high half into the low half or something.  (I only really know
x86 SIMD).

Ideally we could tell the back-end that the high half values are actually
don't-care and can be anything, so it can choose the best shuffles to extract
the high half for each successive narrowing step.

I think clang does this: its shuffle-optimizer makes different choices in a
function that returns __m128 vs. returning the low element of a vector as a
scalar float.  (e.g. for hand-coded horizontal sum using Intel _mm_
intrinsics).

-------

To get truly optimal code, the backend needs more choice in what order to do
the shuffles.  e.g. with SSE3, the optimal sequence for FP hsum is probably

   movshdup  %xmm0, %xmm1   # DCBA -> DDBB
   addps     %xmm1, %xmm0   # D+D C+D B+B A+B  (elements 0 and 2 are valuable)
   movhlps   %xmm0, %xmm1   # Do this second so a dependency on xmm1 can't be a
problem
   addss     %xmm1, %xmm0   # addps saves 1 byte of code.

We could use MOVHLPS first to maintain the usual successive-narrowing pattern,
but (to avoid a MOVAPS) only if we have a scratch register that was ready
earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the
critical path).  It also needs to be holding FP values that won't cause a
slowdown from denormals in the high two elements for the first addps.  (NaN /
infinity are ok for all current SSE hardware).  Adding a number to itself is
safe enough, so a shuffle that duplicates the high-half values is good.

However, when auto-vectorizing an FP reduction with -fassociative-math but
without the rest of -ffast-math, I guess we need to avoid spurious exceptions
from values in elements that are otherwise don't-care.  Swapping high/low
halves is always safe, e.g. using movaps + shufps for both steps:

  DCBA -> BADC and get D+B C+A B+D A+C.
  WXYZ -> XWZY and get D+B+C+A repeated four times

With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still
makes it 1 byte larger than vmovhlps or vunpckhpd.

x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem
than for integer code where you can always just use PSHUFD unless tuning for
SlowShuffle CPUs like first-gen Core2, or K8.

With AVX, VPERMILPS with an immediate operand is pointless, except I guess with
a memory source.  It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte
VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using
the same register as both sources), except KNL where single-source shuffles are
faster.  Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD
as a copy+shuffle.

------


> demote this to first add the two halves and then continue with the reduction 
> scheme.

Sounds good.

With x86 AVX512, it takes two successive narrowing steps to get down to 128bit
vectors.  Narrowing to 256b allows shorter instructions (VEX instead of EVEX).

Even with 512b or 256b execution units, narrowing to 128b is about as efficient
as possible.  Doing the lower-latency in-lane shuffles first would let more
instructions retire earlier, but only by a couple cycles.  I don't think it
makes any sense to special-case for AVX512 and do in-lane 256b ops ending with
vextractf128, especially since gcc's current design does most of the reduction
strategy in target-independent code.

IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops. 
It seems unlikely since 512b lane-crossing shuffles would have to decode to
*many* uops, especially stuff like VPERMI2PS (select elements from two 512b
sources).  Still, AMD seems to like the half-width idea, so I wouldn't be
surprised to see an AVX512 CPU with 256b execution units.  Even 128b is
plausible (especially for low-power / small-core CPUs like Jaguar or
Silvermont).

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to