https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109

--- Comment #3 from mjr19 at cam dot ac.uk ---
It might be helpful if GCC considered this optimisation separately from
unrolling.

Traditional unrolling attempts to reduce the overhead of the (integer) loop
control instructions, but with floating point loops these integer instructions
usually get issued for free simultaneously with the FP ones, and the branch is
fully predicted. So since processor advances in the mid-1990s, the gains for
unrolling FP loops have become rather small. GCC does not unless explicitly
asked to.

However unrolling a reduction operation in order to break a data dependency is
still very relevant, and can produce speed-ups of a factor of three or more.
One might well expect this sort of unrolling at -O3 or even -O2, whereas the
other sort might not be expected at -O3 (and currently is not enabled even at
-Ofast).

Reply via email to