[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

peter at cordes dot ca Fri, 25 Jan 2019 15:51:27 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. 
Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner
Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput.

On AMD Excavator/Ryzen, it's 1 uop with 1c latency.  On Steamroller and
earlier, it's 2 uops but 1c latency.  (I assume that's latency from the
non-control input to the output.  So maybe one of the uops pre-processes the
control input, otherwise you'd expect 2c latency from either operand.)  Ryzen
dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16
which would avoid the need for mov reg,imm32 with the control operand.  But
mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND

BMI2 RORX is single-uop on all CPUs that support it.  If we already need a 2nd
uop to mask anyway, we can use RORX+AND-immediate to duplicate the
functionality and performance of BEXTR-immediate, with the smaller code-size if
the AND-mask fits in an imm8.  (5+5 vs. 6+3  or 6+4 if the AND needs a REX)

Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage
mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with
wide bitfields using a count instead of an immediate AND mask.  (Especially if
it doesn't fit in 32 bits).

If you can reuse the same control-register in a loop, BEXTR is good-ish for
copy-and-extract.

PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2.  But
PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput).  So for 32-bit
constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than
BEXTR on Intel.  movabs imm64 is too big and can cause front-end problems
(slower to read from the uop cache, if that effect from Sandybridge is still
present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the
bitfield you're extracting is wider than 32 bits.

PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. 
(All integer uops with latency > 1 are p1-only).  It's potentially good for
throughput, but worse than RORX+AND for latency.

Unfortunately x86 bitfield instructions are pretty weak compared to ARM /
AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply
specified as immediates.  Only AMD's immediate version of BEXTR (1 uop on
Excavator) matched them.  Having a bunch of different control operands for
BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely
useful than immediate controls.

----


0000000000000000 <extract_skylake_hand_optimized>:
   0:   c4 e3 fb f0 c7 2a       rorx   $0x2a,%rdi,%rax    # $(64-22)
   6:   c4 e3 fb f0 d7 35       rorx   $0x35,%rdi,%rdx    # $(64-11)
   c:   83 e7 3f                and    $0x3f,%edi
   f:   83 e0 3f                and    $0x3f,%eax
  12:   83 e2 3f                and    $0x3f,%edx
  15:   01 f8                   add    %edi,%eax     # 32-bit operand-size
because we can prove it can't overflow
  17:   01 d0                   add    %edx,%eax     # missed optimization in
both gcc's versions.
  19:   c3                      retq   

Not counting the ret, this is 7 uops for Skylake and Ryzen.  **I'm pretty sure
this is our best bet for -march=skylake, and for tune=generic -mbmi2**

The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes
larger.  (not counting the savings from avoiding a REX prefix on the ADD
instructions; that missed optimization applies equally to both.)  OTOH, the
critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could
still consider it for -march=znver1.  Or for tune=generic -mbmi without BMI2.

The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax`
instruction; it *should* be 9 uops for all normal CPUs.

---

With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR
version.  It's not worse than the mov+shr+and version on SnB-family or bd/zn,
and it's better on some AMD.  And it's probably smaller code-size.

And in future if Intel designs CPUs that can handle BEXTR as a single uop with
1c latency, mov+bextr will become good-ish everywhere.


For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte,
because AND $imm32, %r32 is 6 bytes long instead of 3.

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

Reply via email to