https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #32 from Paul Caprioli <paul at hpkfft dot com> ---
> Yes, that part is definitely very bad as it also breaks store-to-load
> forwarding. It's one of those argument-return issues that plagues GCC.
Yes, since x86 provides total store order, all the older stores in the store
buffer have to drain as well, so that's the biggest performance problem. (I
just now noticed this was mentioned in comment 18. Sorry for repeating.)
For the sake of conversation, let's assume the function is inlined so that
problem goes away.
I was thinking about your comment about using AVX512 embedded broadcast. I
think that's a nice idea for a complex multiply. Without the embedded
broadcast, the code below is 2 loads, 3 shuffles, and 2 floating-point
instructions. If hardware has only 1 shuffle port (e.g., Intel Sapphire
Rapids), that's the resource constraint.
vmovq (%rdi), %xmm0
vmovq (%rsi), %xmm2
vmovsldup %xmm0, %xmm1
vmovshdup %xmm0, %xmm0
vshufps $0xe1, %xmm2, %xmm2, %xmm3
vmulps %xmm3, %xmm0, %xmm0
vfmaddsub231ps %xmm2, %xmm1, %xmm0
For double, using the EVEX.128 encoding, I think your idea just works! The two
dups are eliminated and only one shuffle remains.
For float, I think it's good to avoid setting the invalid bit in MXCSR in case
an infinity is broadcast and multiplied by zero. Instead of using a mask
register, maybe the vshufps can use $0x11 so that bits [127:64] are written to
be the same as [63:0]. The idea is to perform the same math in the top 64 bits
as in the bottom 64.
For _Float16, the complex multiply instruction VFMULCSH does all the shuffles
for free in the floating-point unit, so that's best.
For multiplying arrays of complex numbers (not just one complex multiply), the
embedded broadcast doesn't seem useful. You'd want to use the full SIMD
register to do more than a single complex multiply at a time.
For single:
vmovsldup (%rax), %ymm0
vmovshdup (%rax), %ymm1
vmovups (%rdx), %ymm4
vshufps $0xB1, %ymm4, %ymm4, %ymm5
vmulps %ymm1, %ymm5, %ymm5
vfmaddsub231ps %ymm0, %ymm4, %ymm5
Above is 3 loads, 1 shuffle, 2 floating-point instructions. The duplication is
done for free in the load unit by using the memory-operand form of the dup
instructions.
For double, sadly there's no such thing as vmovdhdup. It's probably good to
use vmovddup from memory to do the one duplication for free in the load unit.
The other duplication is a register-to-register shuffle. So, 3 loads, 2
shuffles, 2 floating-point instructions.
For _Float16, VFMULCPH does all shuffles for free in the floating-point unit.