https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109
--- Comment #3 from mjr19 at cam dot ac.uk --- It might be helpful if GCC considered this optimisation separately from unrolling. Traditional unrolling attempts to reduce the overhead of the (integer) loop control instructions, but with floating point loops these integer instructions usually get issued for free simultaneously with the FP ones, and the branch is fully predicted. So since processor advances in the mid-1990s, the gains for unrolling FP loops have become rather small. GCC does not unless explicitly asked to. However unrolling a reduction operation in order to break a data dependency is still very relevant, and can produce speed-ups of a factor of three or more. One might well expect this sort of unrolling at -O3 or even -O2, whereas the other sort might not be expected at -O3 (and currently is not enabled even at -Ofast).
