https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #1 from Peter Cordes <peter at cordes dot ca> --- Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput. On AMD Excavator/Ryzen, it's 1 uop with 1c latency. On Steamroller and earlier, it's 2 uops but 1c latency. (I assume that's latency from the non-control input to the output. So maybe one of the uops pre-processes the control input, otherwise you'd expect 2c latency from either operand.) Ryzen dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16 which would avoid the need for mov reg,imm32 with the control operand. But mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND BMI2 RORX is single-uop on all CPUs that support it. If we already need a 2nd uop to mask anyway, we can use RORX+AND-immediate to duplicate the functionality and performance of BEXTR-immediate, with the smaller code-size if the AND-mask fits in an imm8. (5+5 vs. 6+3 or 6+4 if the AND needs a REX) Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with wide bitfields using a count instead of an immediate AND mask. (Especially if it doesn't fit in 32 bits). If you can reuse the same control-register in a loop, BEXTR is good-ish for copy-and-extract. PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2. But PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput). So for 32-bit constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than BEXTR on Intel. movabs imm64 is too big and can cause front-end problems (slower to read from the uop cache, if that effect from Sandybridge is still present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the bitfield you're extracting is wider than 32 bits. PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. (All integer uops with latency > 1 are p1-only). It's potentially good for throughput, but worse than RORX+AND for latency. Unfortunately x86 bitfield instructions are pretty weak compared to ARM / AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply specified as immediates. Only AMD's immediate version of BEXTR (1 uop on Excavator) matched them. Having a bunch of different control operands for BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely useful than immediate controls. ---- 0000000000000000 <extract_skylake_hand_optimized>: 0: c4 e3 fb f0 c7 2a rorx $0x2a,%rdi,%rax # $(64-22) 6: c4 e3 fb f0 d7 35 rorx $0x35,%rdi,%rdx # $(64-11) c: 83 e7 3f and $0x3f,%edi f: 83 e0 3f and $0x3f,%eax 12: 83 e2 3f and $0x3f,%edx 15: 01 f8 add %edi,%eax # 32-bit operand-size because we can prove it can't overflow 17: 01 d0 add %edx,%eax # missed optimization in both gcc's versions. 19: c3 retq Not counting the ret, this is 7 uops for Skylake and Ryzen. **I'm pretty sure this is our best bet for -march=skylake, and for tune=generic -mbmi2** The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes larger. (not counting the savings from avoiding a REX prefix on the ADD instructions; that missed optimization applies equally to both.) OTOH, the critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could still consider it for -march=znver1. Or for tune=generic -mbmi without BMI2. The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax` instruction; it *should* be 9 uops for all normal CPUs. --- With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR version. It's not worse than the mov+shr+and version on SnB-family or bd/zn, and it's better on some AMD. And it's probably smaller code-size. And in future if Intel designs CPUs that can handle BEXTR as a single uop with 1c latency, mov+bextr will become good-ish everywhere. For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte, because AND $imm32, %r32 is 6 bytes long instead of 3.