http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #7 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 22:00:36
UTC ---
(In reply to comment #3)
> Your testcase doesn't ressemble the original, the inner for cycles need
> clearing of the iteration variable.
Ah, indeed... fingers were too fast.
One additional data point with -O2 -ftree-vectorize -mfma4 -mavx with all
loops:
movslq %r8d, %rax
movl $C+32, %edx
xorl %esi, %esi
leaq B(,%rax,8), %rcx
movl $C, %eax
.L3:
>> vmovsd 80(%rcx), %xmm1
addl $2, %esi
vmovapd A(%rdi), %ymm0
>> vmovddup %xmm1, %xmm1
vbroadcastsd (%rcx), %ymm2
addq $160, %rcx
>> vinsertf128 $1, %xmm1, %ymm1, %ymm1
vfmaddpd (%rax), %ymm2, %ymm0, %ymm2
vmovapd %ymm2, (%rax)
addq $64, %rax
vfmaddpd (%rdx), %ymm1, %ymm0, %ymm0
vmovapd %ymm0, (%rdx)
addq $64, %rdx
cmpl $10, %esi
jne .L3
This could be just "vbroadcastsd 80(%rcx), %ymm1". For some reason combine pass
does not form it.