https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #5 from Peter Cordes <pcordes at gmail dot com> --- (In reply to Richard Biener from comment #2) > (In reply to Hongtao Liu from comment #1) > > It's done by r12-1958, it's better for dcache, but worse for icache, small > > benchmark in the commit show broadcast from integer is slightly better than > > constant pool, maybe we should make it as a u-arch specific tuning. > > I see it was benchmarked on Intel CPU which have a shared register file, I > was specifically wondering of the AMD case where any integer <-> FP/vector > boundary crossing incurs a latency penalty. Intel and AMD CPUs both have 3 separate register files: integer, SIMD/FP, and x87/k-mask registers. (I forget if AMD shares a PRF for x87 and AVX-512 mask registers, but they both definitely have separate PRFs for GPR vs. X/Y/ZMM. The chipsandcheese Zen 5 article shows it using a separate EFLAGS register file, too; Intel has enough extra bits in every integer PRF entry to hold a GPR + condition-codes result.) https://www.realworldtech.com/haswell-cpu/6/ https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram https://chipsandcheese.com/p/lion-cove-intels-p-core-roars https://chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-desktop The difference is in the latency of domain-crossing uops that read integer and write SIMD/FP or vice versa. It's lower on Intel P-cores, higher on AMD and on Intel E-cores. That's probably a matter of how close the wiring physically is. And/or perhaps in building a scheduler that can track dependencies across both domains? If the latter is a factor, Intel (before Lion Cove) uses a mostly-unified scheduler with entries for vector or integer ALU, and another group of entries for memory. Lion Cove has separate scheduler entries that can only hold vector or only hold integer. AMD has a couple separate scheduling queues for groups of SIMD/FP execution ports. Intel E-cores are like that, too. (https://stackoverflow.com/questions/72577590/why-does-movd-movq-between-gp-and-simd-registers-have-quite-high-latency) > In reality what is faster always depends on the surrounding code, but > IMO code size (and uop cache space) wins easily. Quite likely > doing full vector loads from the constant pool for XMM initialization > is better than broadcast from scalar, possibly even YMM, for the same > reason. It depends on the code. Simple code that processes a *lot* of data can have a relatively small I-cache footprint while churning up D-cache so much that constants are likely evicted. Maybe they survive in L3 thanks to an adaptive replacement policy that tries to avoid having loops over huge working sets blow everything away. In code with really long-running loops, the cost of setting up constants is amortized over many iterations so it barely matters, but it gets more interesting if cache-blocking so you're alternating between a few hot functions. I'd be very skeptical of full-vector constants unless that enables using them as a memory source operand (without AVX-512 for embedded-broadcast) for other ops. Even then, that is a big tradeoff. VPBROADCASTD loads are single uop and have the same throughput and latency as VMOVUPS. D-cache pressure is a real thing, too. I was discussing this with Maxim Egorushkin recently; he was saying (https://stackoverflow.com/questions/30674291/how-to-check-inf-for-avx-intrinsic-m256/79732626#comment140653747_31203100) that things like asm("":"+v"(x), "+r"(y)); were often necessary to block constant-propagation into a load from .rodata when a purely-ALU strategy for materializing a constant was better, especially cases like VPCMPEQD / VPSRLD to make a mask like 0x7FFFFFFF. (Linear algebra HPC on Zen 3, where apparently L1d cache pressure is a big deal for his use-case, so loading unnecessary constants was a big problem.) So yeah, speaking of that, set1_epi32(0x7FFFFFFF) should definitely use VPCMPEQD / VPSRLD, especially if you also need an all-ones constant for something else or can make other constants from the same all-ones. (https://stackoverflow.com/questions/35085059/what-are-the-best-instruction-sequences-to-generate-vector-constants-on-the-fly) (But on Intel CPUs, beware that a register value written by integer instructions adds extra latency for *both* operands to FP instructions using it indefinitely, long after its written back. So materializing an FP constant like 1.0 from an all-ones from VPCMPEQD could infect the critical path using it with extra latency until the next XSAVE/XRSTOR. https://stackoverflow.com/questions/64116679/haswell-avx-fma-latencies-tested-1-cycle-slower-than-intels-guide-says - it's a thing on Skylake and Haswell at least. IDK if the effect is present on later uarches.) > I'm not sure how to best do a micro-benchmark measuring the actual > latency of the variants in question. Latency would usually only be relevant after a branch mispredict or I-cache miss or other stall. Normally out-of-order exec can get constants materialized ahead of runtime-variable data being ready. (Unless we get a d-cache miss loading from .rodata, in which case it can be a big stall). Throughput is a much more relevant metric for use-cases that don't get hoisted out of loops. In that case a 4-byte broadcast-load is the obvious winner especially without AVX-512: only costing a tiny bit extra code size vs. a full vector, no extra uops (not even in the back-end) and saving a lot of D-cache footprint which is nice when you have multiple constants. (Especially which get loaded together.) Your microbenchmark measures throughput for the hot-cache case, which of course makes loads look good. It also makes I-cache and D-cache pressure irrelevant. 4-byte is probably the narrowest we should go for memory source: VPBROADCASTB/W cost an extra ALU shuffle uop in the back-end, which is worse for latency and competes for the shuffle port which can often be a bottleneck in SIMD algorithms that do any shuffling. (So competing with the previous loop, for example). VPBROADCASTB/W do stay micro-fused so still only take up 1 slot in the uop-cache and even in the ROB, though. VPBROADCASTD/Q (and VPBROADCASTF/I128) are single-uop, handled entirely by the load execution unit on Intel and AMD. Wide non-broadcast load can be worth considering for XMM when we don't have AVX available for VBROADCASTSD or AVX2 for VPBROADCASTD. The extra front-end cost of MOVD + PSHUFD instead of just a single load is kinda bad. AVX-512 makes MOV-immediate + ALU a lot more attractive since VPBROADCAST GPR to X/Y/ZMM saves an instruction. That seems like a good choice when available, if we can't get it from an all-ones vector in one instruction like a shift or PABSB/W/D (for set1(1)) or something. I don't know what strategy is best as a default, especially for just AVX2, but for most programs I'd be surprised if full-width non-broadcast vectors is better. (Again, unless they can't hoist their constants because there aren't loops or because of register pressure, and full-width enables using memory-source operands.)
