https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #6 from Chris Elrod <elrodc at gmail dot com> ---
Hongtao Liu, I do think that one should ideally be able to get optimal codegen
when using 512-bit builtin vectors or vector intrinsics, without needing to set
`-mprefer-vector-width=512` (and, currently, also setting
`-mtune-ctrl=avx512_move_by_pieces`).
For example, if I remove `-mprefer-vector-width=512`, I get
prod(Dual<Dual<double, 7l>, 2l>&, Dual<Dual<double, 7l>, 2l> const&,
Dual<Dual<double, 7l>, 2l> const&):
push rbp
mov eax, -2
kmovb k1, eax
mov rbp, rsp
and rsp, -64
sub rsp, 264
vmovdqa ymm4, YMMWORD PTR [rsi+128]
vmovapd zmm8, ZMMWORD PTR [rsi]
vmovapd zmm9, ZMMWORD PTR [rdx]
vmovdqa ymm6, YMMWORD PTR [rsi+64]
vmovdqa YMMWORD PTR [rsp+8], ymm4
vmovdqa ymm4, YMMWORD PTR [rdx+96]
vbroadcastsd zmm0, xmm8
vmovdqa ymm7, YMMWORD PTR [rsi+96]
vbroadcastsd zmm1, xmm9
vmovdqa YMMWORD PTR [rsp-56], ymm6
vmovdqa ymm5, YMMWORD PTR [rdx+128]
vmovdqa ymm6, YMMWORD PTR [rsi+160]
vmovdqa YMMWORD PTR [rsp+168], ymm4
vxorpd xmm4, xmm4, xmm4
vaddpd zmm0, zmm0, zmm4
vaddpd zmm1, zmm1, zmm4
vmovdqa YMMWORD PTR [rsp-24], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+64]
vmovapd zmm3, ZMMWORD PTR [rsp-56]
vmovdqa YMMWORD PTR [rsp+40], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+160]
vmovdqa YMMWORD PTR [rsp+200], ymm5
vmulpd zmm2, zmm0, zmm9
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmulpd zmm5, zmm1, zmm3
vbroadcastsd zmm3, xmm3
vmovdqa YMMWORD PTR [rsp+232], ymm6
vaddpd zmm3, zmm3, zmm4
vmovapd zmm7, zmm2
vmovapd zmm2, ZMMWORD PTR [rsp+8]
vfmadd231pd zmm7{k1}, zmm8, zmm1
vmovapd zmm6, zmm5
vmovapd zmm5, ZMMWORD PTR [rsp+136]
vmulpd zmm1, zmm1, zmm2
vfmadd231pd zmm6{k1}, zmm9, zmm3
vbroadcastsd zmm2, xmm2
vmovapd zmm3, ZMMWORD PTR [rsp+200]
vaddpd zmm2, zmm2, zmm4
vmovapd ZMMWORD PTR [rdi], zmm7
vfmadd231pd zmm1{k1}, zmm9, zmm2
vmulpd zmm2, zmm0, zmm5
vbroadcastsd zmm5, xmm5
vmulpd zmm0, zmm0, zmm3
vbroadcastsd zmm3, xmm3
vaddpd zmm5, zmm5, zmm4
vaddpd zmm3, zmm3, zmm4
vfmadd231pd zmm2{k1}, zmm8, zmm5
vfmadd231pd zmm0{k1}, zmm8, zmm3
vaddpd zmm2, zmm2, zmm6
vaddpd zmm0, zmm0, zmm1
vmovapd ZMMWORD PTR [rdi+64], zmm2
vmovapd ZMMWORD PTR [rdi+128], zmm0
vzeroupper
leave
ret
prod(Dual<Dual<double, 8l>, 2l>&, Dual<Dual<double, 8l>, 2l> const&,
Dual<Dual<double, 8l>, 2l> const&):
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 648
vmovdqa ymm5, YMMWORD PTR [rsi+224]
vmovdqa ymm3, YMMWORD PTR [rsi+352]
vmovapd zmm0, ZMMWORD PTR [rdx+64]
vmovdqa ymm2, YMMWORD PTR [rsi+320]
vmovdqa YMMWORD PTR [rsp+104], ymm5
vmovdqa ymm5, YMMWORD PTR [rdx+224]
vmovdqa ymm7, YMMWORD PTR [rsi+128]
vmovdqa YMMWORD PTR [rsp+232], ymm3
vmovsd xmm3, QWORD PTR [rsi]
vmovdqa ymm6, YMMWORD PTR [rsi+192]
vmovdqa YMMWORD PTR [rsp+488], ymm5
vmovdqa ymm4, YMMWORD PTR [rdx+192]
vmovapd zmm1, ZMMWORD PTR [rsi+64]
vbroadcastsd zmm5, xmm3
vmovdqa YMMWORD PTR [rsp+200], ymm2
vmovdqa ymm2, YMMWORD PTR [rdx+320]
vmulpd zmm8, zmm5, zmm0
vmovdqa YMMWORD PTR [rsp+8], ymm7
vmovdqa ymm7, YMMWORD PTR [rsi+256]
vmovdqa YMMWORD PTR [rsp+72], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+128]
vmovdqa YMMWORD PTR [rsp+584], ymm2
vmovsd xmm2, QWORD PTR [rdx]
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+256]
vmovdqa YMMWORD PTR [rsp+392], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+352]
vmulsd xmm10, xmm3, xmm2
vmovdqa YMMWORD PTR [rsp+456], ymm4
vbroadcastsd zmm4, xmm2
vfmadd231pd zmm8, zmm4, zmm1
vmovdqa YMMWORD PTR [rsp+520], ymm7
vmovdqa YMMWORD PTR [rsp+616], ymm6
vmulpd zmm9, zmm4, ZMMWORD PTR [rsp+72]
vmovsd xmm6, QWORD PTR [rsp+520]
vmulpd zmm4, zmm4, ZMMWORD PTR [rsp+200]
vmulpd zmm11, zmm5, ZMMWORD PTR [rsp+456]
vmovsd QWORD PTR [rdi], xmm10
vmulpd zmm5, zmm5, ZMMWORD PTR [rsp+584]
vmovapd ZMMWORD PTR [rdi+64], zmm8
vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8}
vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8}
vmovsd xmm0, QWORD PTR [rsp+392]
vmulsd xmm7, xmm3, xmm0
vbroadcastsd zmm0, xmm0
vmulsd xmm3, xmm3, xmm6
vfmadd132pd zmm0, zmm11, zmm1
vbroadcastsd zmm6, xmm6
vfmadd132pd zmm1, zmm5, zmm6
vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8]
vfmadd132sd xmm2, xmm3, QWORD PTR [rsp+136]
vaddpd zmm0, zmm0, zmm9
vaddpd zmm1, zmm1, zmm4
vmovapd ZMMWORD PTR [rdi+192], zmm0
vmovsd QWORD PTR [rdi+128], xmm7
vmovsd QWORD PTR [rdi+256], xmm2
vmovapd ZMMWORD PTR [rdi+320], zmm1
vzeroupper
leave
ret
GCC respects the vector builtins and uses 512 bit ops, but then does splits and
spills across function boundaries.
So, what I'm arguing is, while it would be great to respect
`-mprefer-vector-width=512`, it should ideally also be able to respect vector
builtins/intrinsics, so that one can use full width vectors without also having
to set `-mprefer-vector-width=512 -mtune-control=avx512_move_by_pieces`.