https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- (In reply to Richard Biener from comment #1) > That is, it was supposed to end up using pslldq I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that makes bit/byte shift directions all match up with the diagram. Opposite of array-initializer order. PSRLDQ is sub-optimal without AVX. It needs a MOVDQA to copy-and-shuffle. For integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with 16-bit elements). None of the x86 copy-and-shuffle instructions can zero an element, only copy from one of the source elements. PSHUFD can easily swap the high and low halves, but maybe some other targets can't do that as efficiently as just duplicating the high half into the low half or something. (I only really know x86 SIMD). Ideally we could tell the back-end that the high half values are actually don't-care and can be anything, so it can choose the best shuffles to extract the high half for each successive narrowing step. I think clang does this: its shuffle-optimizer makes different choices in a function that returns __m128 vs. returning the low element of a vector as a scalar float. (e.g. for hand-coded horizontal sum using Intel _mm_ intrinsics). ------- To get truly optimal code, the backend needs more choice in what order to do the shuffles. e.g. with SSE3, the optimal sequence for FP hsum is probably movshdup %xmm0, %xmm1 # DCBA -> DDBB addps %xmm1, %xmm0 # D+D C+D B+B A+B (elements 0 and 2 are valuable) movhlps %xmm0, %xmm1 # Do this second so a dependency on xmm1 can't be a problem addss %xmm1, %xmm0 # addps saves 1 byte of code. We could use MOVHLPS first to maintain the usual successive-narrowing pattern, but (to avoid a MOVAPS) only if we have a scratch register that was ready earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the critical path). It also needs to be holding FP values that won't cause a slowdown from denormals in the high two elements for the first addps. (NaN / infinity are ok for all current SSE hardware). Adding a number to itself is safe enough, so a shuffle that duplicates the high-half values is good. However, when auto-vectorizing an FP reduction with -fassociative-math but without the rest of -ffast-math, I guess we need to avoid spurious exceptions from values in elements that are otherwise don't-care. Swapping high/low halves is always safe, e.g. using movaps + shufps for both steps: DCBA -> BADC and get D+B C+A B+D A+C. WXYZ -> XWZY and get D+B+C+A repeated four times With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still makes it 1 byte larger than vmovhlps or vunpckhpd. x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem than for integer code where you can always just use PSHUFD unless tuning for SlowShuffle CPUs like first-gen Core2, or K8. With AVX, VPERMILPS with an immediate operand is pointless, except I guess with a memory source. It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using the same register as both sources), except KNL where single-source shuffles are faster. Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD as a copy+shuffle. ------ > demote this to first add the two halves and then continue with the reduction > scheme. Sounds good. With x86 AVX512, it takes two successive narrowing steps to get down to 128bit vectors. Narrowing to 256b allows shorter instructions (VEX instead of EVEX). Even with 512b or 256b execution units, narrowing to 128b is about as efficient as possible. Doing the lower-latency in-lane shuffles first would let more instructions retire earlier, but only by a couple cycles. I don't think it makes any sense to special-case for AVX512 and do in-lane 256b ops ending with vextractf128, especially since gcc's current design does most of the reduction strategy in target-independent code. IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops. It seems unlikely since 512b lane-crossing shuffles would have to decode to *many* uops, especially stuff like VPERMI2PS (select elements from two 512b sources). Still, AMD seems to like the half-width idea, so I wouldn't be surprised to see an AVX512 CPU with 256b execution units. Even 128b is plausible (especially for low-power / small-core CPUs like Jaguar or Silvermont).