http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53340
--- Comment #2 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-05-14 09:49:22 UTC --- If I understand correctly the profiling, the slowdown comes from the first inlined function minlst. The fast assembly is L45: movss (%r10), %xmm10 leal -1(%rsi), %edi movss -4(%r10), %xmm11 comiss %xmm10, %xmm6 movss -8(%r10), %xmm12 minss %xmm10, %xmm6 movss -12(%r10), %xmm13 cmova %esi, %edx comiss %xmm11, %xmm6 minss %xmm11, %xmm6 cmova %edi, %edx comiss %xmm12, %xmm6 minss %xmm12, %xmm6 leal -2(%rsi), %edi cmova %edi, %edx comiss %xmm13, %xmm6 leal -3(%rsi), %edi minss %xmm13, %xmm6 cmova %edi, %edx subl $4, %esi subq $16, %r10 cmpl %r8d, %esi jne L45 while the slow one is L39: movslq %edx, %r9 movss -4(%rdi,%r9,4), %xmm9 leal -1(%r8), %r9d comiss (%rbx), %xmm9 cmova %r8d, %edx movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm10 comiss -4(%rbx), %xmm10 cmova %r9d, %edx leal -2(%r8), %r9d movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm11 comiss -8(%rbx), %xmm11 cmova %r9d, %edx leal -3(%r8), %r9d movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm12 comiss -12(%rbx), %xmm12 cmova %r9d, %edx subl $4, %r8d subq $16, %rbx cmpl %r10d, %r8d jne L39