https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
Bug ID: 91103 Summary: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* GCC9.1 and current trunk aren't good at extracting high elements, whether it's with GNU C native vector syntax, or when auto-vectorizing something that ends with the result in the high element. Using VALIGND we can get any element with one immediate instruction, but its better to use AVX2 VPERMPD(immediate) when possible. Or inside loops, VPERMPS(vector), or VPERMT2PS(vector). Or of course vextractf32x4 if possible (element at the bottom of a 128-bit lane). Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and __m256d vectors is still a big win. #include <immintrin.h> float elem12(__m512 v) { return v[12]; } float elem15(__m512 v) { return v[15]; } gcc -Ofast -march=skylake-avx512 https://godbolt.org/z/241r8p elem15: vextractf32x8 ymm0, zmm0, 0x1 vextractf128 xmm0, ymm0, 0x1 # elem12 ends here, after these 2 insns vshufps xmm0, xmm0, xmm0, 255 # no vzeroupper I guess because the caller must have __m512 vars too, recent optimization ret But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude the need for AVX2 vextractf128. That's what clang does. Obviously inside a loop it would be *much* better to use a single lane-crossing VPERMPS to also avoid the shufps. Intel Skylake easily bottlenecks on shuffle throughput. We'd need a 15 in an XMM register as a control vector, but loading it would be off the latency critical path. (If we needed the scalar zero-extended instead of garbage in high elements, we could VPERMI2PS or VPERMT2PS with a zeroed vector and a shuffle-control.) --- If the element we want is an even element in the low 256 bits, we can get it with a VPERMPD-immediate. GCC does this: elem6(float __vector(16)): # GCC 10 trunk vextractf128 xmm0, ymm0, 0x1 vunpckhps xmm0, xmm0, xmm0 ret Instead it should be AVX2 vpermpd ymm0, ymm0, 3 This bug also applies to __m256, not just __m512 https://www.felixcloutier.com/x86/vpermpd VPERMPD is a 64-bit granularity lane-crossing shuffle. The AVX512F immediate version reuses the immediate for another 256-bit wide shuffle in the upper half; only the vector-control version can bring an element from the top half of a ZMM down to the bottom. But if we're going to use a vector control, we might as well use VPERMPS. For the integer version of this bug, use VPERMQ ------ But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP data. There unfortunately isn't an FP flavour of VALIGND, just integer. AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles between FP math instructions, i.e. the shift unit is connected to both FP and integer forwarding networks. Intel's optimization manual for Skylake (client) has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3 reading from anything, or anything reading from it. https://www.felixcloutier.com/x86/valignd:valignq It's a 4 or 8-byte granularity version of palignr, except that it's lane-crossing so the 256 and 512-bit versions are actually useful. The immediate shift count can thus bring *any* element down to the bottom. (Using the same input twice makes it a rotate). VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has 1 per clock throughput. For *any* compile-time-constant index, we can always compile v[i] to this: extract15: valignd zmm0, zmm0, zmm0, 15 # I think this is right. ret The only downside I'm aware of is that some future AVX512 CPU might not run VALIGND as efficiently as SKX and KNL. ---- For vector elements narrower than 32 bits, we may need 2 shuffles even if we consider using a shuffle-control vector. On Skylake-AVX512, AVX512BW vpermw will get the job done, but costs 2 shuffle uops. On CannonLake (and presumably other future Intel), it and AVX512VBMI vpermb are only 1 uop, so it's definitely worth creating a shuffle-control vector if it can be reused. Also worth considering instead of 2 shuffles: *unaligned* spill / reload like ICC does for GNU C native vector indexing. Store-forwarding latency is only 6 or 7 cycles I think, and it avoids any port 5 pressure. Not generally a good choice IMO when we can get the job done in one shuffle, but worth considering if we need multiple elements. If the function doesn't need the stack aligned, an unaligned spill is generally cheapish, and store-forwarding still works efficiently. IceLake is supposed to introduce a 2nd shuffle unit; that should help a lot to reduce shuffle port throughput bottlenecks. So we don't want to get too aggressive tuning for store/reload, I don't think. ---- Semi-related for integer shuffles: long long integer_extract(__m256i v) { return v[3]; } uses a longer AVX512VL instruction instead of a shorter AVX2 vextracti128 integer_extract(long long __vector(4)): vextracti64x2 xmm0, ymm0, 0x1 # should be vextracti128 vpextrq rax, xmm0, 1 ret Or store/reload instead of these 2 shuffles, depending on context.