https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Paul Caprioli from comment #33)
> > Despite the issue described above?
>
> Yes, despite the issue you raised, I would still advise GCC to use the
> mul_fma algorithm instead of the mul_mul_add algorithm. I agree, of course,
> that there are input values for which rounding errors cancel in the latter
> but not in the former. But the reverse is true as well. There are input
> values for which the mul_fma algorithm is the more accurate one.
> So, for some inputs, one algorithm is better, and for other inputs, the
> other algorithm is better.
>
> If error is measured norm-wise, then the mul_fma algorithm has lower maximum
> error. I think norm-wise error is a reasonable measure of goodness.
> Without application-specific knowledge of why complex numbers are being
> used, it's reasonable to consider a complex number as a single mathematical
> entity.
>
> Cartesian coordinates are a useful representation, but not the only one.
> Polar coordinates are a valid representation as well. For (a + bi) where a
> >> b, the complex number is close to the real axis. Its magnitude is close
> to |a| and its angle is close to zero (if a is positive). Is the relative
> error of the computed b important if b itself is very tiny? An error in b
> may be large relative to b, but not significantly change the magnitude or
> angle of the complex number.
>
> For component-wise relative accuracy, both mul_mul_add and mul_fma can be
> highly inaccurate due to possible catastrophic cancellations in the real or
> imaginary part of
> the computed product.
>
> The mul_fma has better throughput (performance) than mul_mul_add.
>
> For _Float16, x86 has standardized the complex multiply instruction as:
>
> tmp.fp16[0] := src1.fp16[0] * src2.fp16[0]
> tmp.fp16[1] := src1.fp16[1] * src2.fp16[0]
> dest.fp16[0] := tmp.fp16[0] - src1.fp16[1] * src2.fp16[1]
> dest.fp16[1] := tmp.fp16[1] + src1.fp16[0] * src2.fp16[1]
>
> so it's reasonable to use mul_fma also for single and double.
>
> > And note that since use of FMA is
> > asymmetric, complex multiplication becomes non-commutative (imag(c0*c1) !=
> > imag(c1*c0) if you expand both sides with the same use of FMA), which can
> > make the result unpredictable as long as GCC doesn't promise to always keep
> > operands of complex multiplication in the source order.
>
> Yes. It lacks beauty. I mean that sincerely. Maybe compilers (or even the
> C or C++ standards) should promise something (similar to how x86 documents
> VFMULCSH). But, even if they do, it's not commutative.
>
> I do like the idea of implementing a more accurate complex multiply.
> Probably, for performance reasons, many users would not want that to be the
> default. A high accurate cmulha() could be provided. Then users could mix
> that with operator*(). Also, a compiler flag could cause cmulha() to be
> used anytime a developer writes *.
>
> I have seen applications that care about component-wise error of complex
> numbers. In some sense, they have (or want) a tuple of two real numbers,
> and a complex number is a convenient way to represent their data in a
> computer program. In some sense, it's not really a mathematical complex
> number.
I'd mention that we have -fcx-method={limited-range,fortran,stdc} which
could be "easily" extended to implement alternate inline lowering for
complex multiplication. Note the option also affects division, splitting
it or allow complex args like -fcx-method=stdc,mul-fma might be possible.
Note there's also libgccs multiplication/division implementation which to
day does not use IFUNCs to select advanced ISA variants but is a plain
C implementation.