https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1 from Peter Cordes ---
Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency.
Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner
Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput.
On AMD Excavator/Ryzen, it's 1 uop with 1c latency. On Steamroller and
earlier, it's 2 uops but 1c latency. (I assume that's latency from the
non-control input to the output. So maybe one of the uops pre-processes the
control input, otherwise you'd expect 2c latency from either operand.) Ryzen
dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16
which would avoid the need for mov reg,imm32 with the control operand. But
mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND
BMI2 RORX is single-uop on all CPUs that support it. If we already need a 2nd
uop to mask anyway, we can use RORX+AND-immediate to duplicate the
functionality and performance of BEXTR-immediate, with the smaller code-size if
the AND-mask fits in an imm8. (5+5 vs. 6+3 or 6+4 if the AND needs a REX)
Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage
mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with
wide bitfields using a count instead of an immediate AND mask. (Especially if
it doesn't fit in 32 bits).
If you can reuse the same control-register in a loop, BEXTR is good-ish for
copy-and-extract.
PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2. But
PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput). So for 32-bit
constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than
BEXTR on Intel. movabs imm64 is too big and can cause front-end problems
(slower to read from the uop cache, if that effect from Sandybridge is still
present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the
bitfield you're extracting is wider than 32 bits.
PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family.
(All integer uops with latency > 1 are p1-only). It's potentially good for
throughput, but worse than RORX+AND for latency.
Unfortunately x86 bitfield instructions are pretty weak compared to ARM /
AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply
specified as immediates. Only AMD's immediate version of BEXTR (1 uop on
Excavator) matched them. Having a bunch of different control operands for
BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely
useful than immediate controls.
:
0: c4 e3 fb f0 c7 2a rorx $0x2a,%rdi,%rax# $(64-22)
6: c4 e3 fb f0 d7 35 rorx $0x35,%rdi,%rdx# $(64-11)
c: 83 e7 3fand$0x3f,%edi
f: 83 e0 3fand$0x3f,%eax
12: 83 e2 3fand$0x3f,%edx
15: 01 f8 add%edi,%eax # 32-bit operand-size
because we can prove it can't overflow
17: 01 d0 add%edx,%eax # missed optimization in
both gcc's versions.
19: c3 retq
Not counting the ret, this is 7 uops for Skylake and Ryzen. **I'm pretty sure
this is our best bet for -march=skylake, and for tune=generic -mbmi2**
The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes
larger. (not counting the savings from avoiding a REX prefix on the ADD
instructions; that missed optimization applies equally to both.) OTOH, the
critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could
still consider it for -march=znver1. Or for tune=generic -mbmi without BMI2.
The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax`
instruction; it *should* be 9 uops for all normal CPUs.
---
With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR
version. It's not worse than the mov+shr+and version on SnB-family or bd/zn,
and it's better on some AMD. And it's probably smaller code-size.
And in future if Intel designs CPUs that can handle BEXTR as a single uop with
1c latency, mov+bextr will become good-ish everywhere.
For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte,
because AND $imm32, %r32 is 6 bytes long instead of 3.