https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
Last reconfirmed|2019-03-04 00:00:00 |2026-2-2
--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
I believe this is all about by-pices move tuning. The problematical thing we
do is:
movq $0, 64(%rsp)
movq %rax, 72(%rsp)
movdqa 64(%rsp), %xmm1
movaps %xmm1, 48(%rsp)
...
movq $1, 80(%rsp)
movsd %xmm0, 88(%rsp)
movdqa 80(%rsp), %xmm2
movaps %xmm2, 48(%rsp)
the earlier stores will fail to forward to the XMM1/2 loads. With higher
optimization we simply elide some of the copies. The above will be bad on
any uarch.
I'll note that -O0 seems to not use XMM moves.
I'm not sure we can/should do much about this as Jakub says.