https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124434
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Brian M. Sutin from comment #6)
> It's barfing up the pipeline on every loop iteration for only long double,
> and the -O1 optimizer knows how to fix the issue.
Yes.
For double we have at -O0:
```
.L3:
movsd -8(%rbp), %xmm0
mulsd -24(%rbp), %xmm0
movsd -32(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -8(%rbp)
addl $1, -12(%rbp)
.L2:
cmpl $999999999, -12(%rbp)
jle .L3
```
The sse loads have a load bypass so the load from `-8(%rbp)` will do a decent
job.