http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796
Yuri Rumyantsev <ysrumyan at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ysrumyan at gmail dot com --- Comment #2 from Yuri Rumyantsev <ysrumyan at gmail dot com> --- This issue does not related to avx2 tuning but rather to estimation of vectorization profitability. Note that avx does not support "gathers" and so the following lnnermost loop for (i=rowR; i<rowRp1; i++) sum += x[ col[i] ] * val[i]; is not vectorized. I did simple experiment and found out that iteration count for it is 5 or 10 (for -large input) and it looks not profitable for avx2 vectorization, i.e. scalar version should be more profitable for execution. If we slightly change this loop to int n = row[r+1] - row[r]; int *col1 = col + row[r]; for (i=0; i<n; i++) sum += x[ col1[i] ] * val[i]; i.e. set up low bound to zero, peformance drop for avx2 will disappear: with avx Sparse matmult Mflops: 2135.59 (N=1000, nz=5000) with avx2 Sparse matmult Mflops: 2309.64 (N=1000, nz=5000)