------- Comment #6 from ubizjak at gmail dot com 2008-11-17 18:11 -------
I think that
addps .LC10(%rip), %xmm0
mulps %xmm1, %xmm0
addps .LC11(%rip), %xmm0
mulps %xmm1, %xmm0
addps .LC12(%rip), %xmm0
mulps %xmm1, %xmm0
addps .LC13(%rip), %xmm0
mulps %xmm1, %xmm0
addps .LC14(%rip), %xmm0
mulps %xmm1, %xmm0
is the bottleneck. Perhaps we should split impilicit memory operands out of the
insn by some generic peephole (if the register is available) and schedule loads
appropriately.
OTOH, loop optimizer should detect invariant loads and move them out of the
loop.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134