For the following test case, prefetches will be inserted for both the load and
store of a[i] if the loop is vectorized:
float a[1024], b[1024];
void foo(int beta)
{
int i;
for(i=0; i<1024; i++)
a[i] = a[i] + beta * b[i];
}
with gcc -O3 -fprefetch-loop-arrays -march=amdfam10 -S, a piece of the assembly
is:
movaps (%rcx), %xmm0
addl $4, %edi
prefetcht0 (%rdx)
prefetcht0 240(%rcx)
prefetchw (%rdx)
leaq 64(%rax), %rsi
mulps %xmm1, %xmm0
If we don't vectorize the loop, we only generate prefetch for the load a[i]:
addl $16, %eax
salq $2, %rcx
mulss %xmm1, %xmm0
prefetcht0 a+92(%rcx)
prefetcht0 b+92(%rcx)
movl %esi, %ecx
--
Summary: Redundant prefetches for the vectorized loop
Product: gcc
Version: 4.6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: changpeng dot fang at amd dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021