https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

            Bug ID: 91103
           Summary: AVX512 vector element extract uses more than 1 shuffle
                    instruction; VALIGND can grab any element
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

GCC9.1 and current trunk aren't good at extracting high elements, whether it's
with GNU C native vector syntax, or when auto-vectorizing something that ends
with the result in the high element.

Using VALIGND we can get any element with one immediate instruction, but its
better to use AVX2 VPERMPD(immediate) when possible.  Or inside loops,
VPERMPS(vector), or VPERMT2PS(vector).  Or of course vextractf32x4 if possible
(element at the bottom of a 128-bit lane).

Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and
__m256d vectors is still a big win.

#include <immintrin.h>
float elem12(__m512 v) {  return v[12]; }
float elem15(__m512 v) {  return v[15]; }

gcc -Ofast -march=skylake-avx512
https://godbolt.org/z/241r8p

elem15:
        vextractf32x8   ymm0, zmm0, 0x1
        vextractf128    xmm0, ymm0, 0x1    # elem12 ends here, after these 2
insns
        vshufps xmm0, xmm0, xmm0, 255
         # no vzeroupper I guess because the caller must have __m512 vars too,
recent optimization
        ret

But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude
the need for AVX2 vextractf128.  That's what clang does.

Obviously inside a loop it would be *much* better to use a single lane-crossing
VPERMPS to also avoid the shufps.  Intel Skylake easily bottlenecks on shuffle
throughput.  We'd need a 15 in an XMM register as a control vector, but loading
it would be off the latency critical path.  (If we needed the scalar
zero-extended instead of garbage in high elements, we could VPERMI2PS or
VPERMT2PS with a zeroed vector and a shuffle-control.)

---

If the element we want is an even element in the low 256 bits, we can get it
with a VPERMPD-immediate.  GCC does this:

elem6(float __vector(16)):         # GCC 10 trunk
        vextractf128    xmm0, ymm0, 0x1
        vunpckhps       xmm0, xmm0, xmm0
        ret

Instead it should be AVX2   vpermpd ymm0, ymm0, 3
This bug also applies to __m256, not just __m512

https://www.felixcloutier.com/x86/vpermpd
VPERMPD is a 64-bit granularity lane-crossing shuffle.  The AVX512F immediate
version reuses the immediate for another 256-bit wide shuffle in the upper
half; only the vector-control version can bring an element from the top half of
a ZMM down to the bottom.  But if we're going to use a vector control, we might
as well use VPERMPS.

For the integer version of this bug, use VPERMQ

------

But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP
data.  There unfortunately isn't an FP flavour of VALIGND, just integer.

AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles
between FP math instructions, i.e. the shift unit is connected to both FP and
integer forwarding networks.  Intel's optimization manual for Skylake (client)
has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3
reading from anything, or anything reading from it.

https://www.felixcloutier.com/x86/valignd:valignq  It's a 4 or 8-byte
granularity version of palignr, except that it's lane-crossing so the 256 and
512-bit versions are actually useful.  The immediate shift count can thus bring
*any* element down to the bottom.  (Using the same input twice makes it a
rotate).

VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has
1 per clock throughput.

For *any* compile-time-constant index, we can always compile v[i] to this:

extract15:
   valignd    zmm0, zmm0, zmm0, 15       # I think this is right.
   ret

The only downside I'm aware of is that some future AVX512 CPU might not run
VALIGND as efficiently as SKX and KNL.


----

For vector elements narrower than 32 bits, we may need 2 shuffles even if we
consider using a shuffle-control vector.  On Skylake-AVX512,  AVX512BW  vpermw 
will get the job done, but costs 2 shuffle uops.  On CannonLake (and presumably
other future Intel), it and  AVX512VBMI vpermb are only 1 uop, so it's
definitely worth creating a shuffle-control vector if it can be reused.


Also worth considering instead of 2 shuffles: *unaligned* spill / reload like
ICC does for GNU C native vector indexing.  Store-forwarding latency is only 6
or 7 cycles I think, and it avoids any port 5 pressure.  Not generally a good
choice IMO when we can get the job done in one shuffle, but worth considering
if we need multiple elements.  If the function doesn't need the stack aligned,
an unaligned spill is generally cheapish, and store-forwarding still works
efficiently.

IceLake is supposed to introduce a 2nd shuffle unit; that should help a lot to
reduce shuffle port throughput bottlenecks.  So we don't want to get too
aggressive tuning for store/reload, I don't think.

----

Semi-related for integer shuffles:

long long integer_extract(__m256i v) {
    return v[3];
}

uses a longer AVX512VL instruction instead of a shorter AVX2 vextracti128

integer_extract(long long __vector(4)):
        vextracti64x2   xmm0, ymm0, 0x1      # should be vextracti128
        vpextrq rax, xmm0, 1
        ret

Or store/reload instead of these 2 shuffles, depending on context.

Reply via email to