With gcc-4.6 -Ofast -funroll-all-loops -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10 sphnix3 runs 5% slower than with gcc-4.6 -Ofast -funroll-all-loops -fno-prefetch-loop-arrays -fno-tree-pre -mveclibabi=acml -m64 -march=amdfam10
prefetching will not cause any slowdown if the vectorizer is turned off, or with -fno-fast-math. I believe the related loops should be those with reductions that the following commit enabled vectorization. http://gcc.gnu.org/ml/gcc-cvs/2010-05/msg00277.html -- Summary: CPU2006 482.sphinx3: gcc4.6 5% regression from prefetching of vectorized loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: changpeng dot fang at amd dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391