https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #27 from Paul Caprioli <paul at hpkfft dot com> --- The motivation for this bug report was accuracy (one might even say correctness), not so much performance. Using FMA in a complex product gives lower maximum relative normwise error. An explanation is given in section 1.1 of https://inria.hal.science/hal-04714173 (and references are given to papers proving the theory). The experimental results in that paper in sections 3 and 4 show that GCC is more accurate than clang for complex multiplication for the code that was tested. GCC (unlike clang) is using FMA for that code, which is great. The "always" in the title of this bug expresses the desire to have FMA used regardless of whether a function is inlined, whether constant propagation allows compile-time computation of the product, whether the code is vectorized, and regardless of cost model or other optimization decisions. For scientific work, it's nice to have this robustness. As an aside comment, in the code for "fast complex" at the bottom of comment 26, I'm not sure I understand: vmovshdup %xmm0, %xmm4 vmovss %xmm0, -8(%rsp) vmovss %xmm4, -4(%rsp) vmovq -8(%rsp), %xmm0 It seems %xmm0 is split into two scalars, which are each stored, and then %xmm0 is loaded to the same value it already has. (If %xmm0 needs to be stored on the stack, then one 8-byte store could be used instead of the shuffle (vmovshdup) and the two 4-byte stores.)
