https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844
Bug ID: 80844 Summary: OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads) Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float sumfloat_omp(const float arr[]) { float sum=0; #pragma omp simd reduction(+:sum) aligned(arr : 64) for (int i=0 ; i<1024 ; i++) sum = sum + arr[i]; return sum; } // https://godbolt.org/g/6KnMXM x86-64 gcc7.1 and gcc8-snapshot-20170520 -mavx2 -ffast-math -funroll-loops -fopenmp -O3 emit: # omitted integer code to align the stack by 32 vpxor %xmm0, %xmm0, %xmm0 # tmp119 vmovaps %xmm0, -48(%rbp) # tmp119, MEM[(void *)&D.2373] vmovaps %xmm0, -32(%rbp) # tmp119, MEM[(void *)&D.2373] vmovaps -48(%rbp), %ymm8 # MEM[(float *)&D.2373], vect__23.20 # then the loop The store-forwarding stall part of this is very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 With gcc4/5/6, we get four integer stores like movq $0, -48(%rbp) before the vector load. Either way, this is ridiculous because vpxor already zeroed the whole ymm vector, and causes a store-forwarding stall. It's also silly because -ffast-math allows 0.0+x to optimize to x. It could start off by simply loading the first vector instead of adding it to 0, in this special case where the loop count is a compile-time constant. --- It's not just AVX 256b vectors, although it's far worse there. With just SSE2: pxor %xmm0, %xmm0 movaps %xmm0, -24(%rsp) # dead store pxor %xmm0, %xmm0 Or from older gcc versions, the same int store->vector reload store-forwarding-stall inducing code. ----- Even though -funroll-loops unrolls, it doesn't use multiple accumulators to run separate dep chains in parallel. So it still bottlenecks on the 4 cycle latency of VADDPS, instead of the 0.5c throughput (Example numbers for Intel Skylake, and yes the results are the same with -march=skylake). clang uses 4 vector accumulators, so it runs 4x faster when data is hot in L2. (up to 8x is possible with data hot in L1). Intel pre-skylake has VADDPS latency=3c throughput=1c, so there's still a factor of three to be had. But Haswell has FMA lat=5c, tput=0.5, so you need 10 accumulators to saturate the 2 load&FMA per clock max throughput. Is there already an open bug for this? I know it's totally separate from this issue.