https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85057
Bug ID: 85057 Summary: GCC fails to vectorize code unless dummy loop is added Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mokreutzer at gmail dot com Target Milestone: --- I have a class which represents short vectors (1D, 2D, 3D) and does numeric computations using the expression template engine PETE[1]. The attached example is stripped down to support only 1D vectors, which is the simplest case but still demonstrates the issue. In my application, vector computations are executed in a loop which is subject to vectorization, as in: int const N = 100000; Vector<1, double> a[N]; // initialize a for (int i=0; i<N; i++) a[i] = 0.5*a[i]; The PETE machinery causes each loop iteration to evaluate an expression in a function evaluate(), which (for 1D vectors) looks like this: template <int N, typename T, typename Op, typename RHS> inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs) { op(lhs(0), forEach(rhs, EvalVectorLeaf<N>(0), OpCombine())); } The issue is that GCC is not able to vectorize above loop, i.e., the assembly code of the loop body is "vmulsd xmm0, xmm1, QWORD PTR [rax]". However, and now comes the crux, GCC can vectorize the loop ("vmulpd ymm0, ymm1, YMMWORD PTR [rax]") if I add a seemingly meaningless dummy loop to the funtion body, as in: template <int N, typename T, typename Op, typename RHS> inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs) { for (int i=0; i<1; i++) op(lhs(i), forEach(rhs, EvalVectorLeaf<N>(i), OpCombine())); } Attached is the code which does not vectorize. A vectorizing version can easily be constructed by adding the loop as shown above. g++ command line: g++ -O3 -mavx System type: x86_64-pc-linux-gnu [1]: The official website of PETE seems to be gone, but a mirror can be found here: https://github.com/erdc/daetk/tree/master/pete/pete-2.1.0