https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844

            Bug ID: 80844
           Summary: OpenMP SIMD doesn't know how to efficiently zero a
                    vector (its stores zeros and reloads)
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

float sumfloat_omp(const float arr[]) {
  float sum=0;
   #pragma omp simd reduction(+:sum) aligned(arr : 64)
  for (int i=0 ; i<1024 ; i++)
    sum = sum + arr[i];
  return sum;
}
// https://godbolt.org/g/6KnMXM

x86-64 gcc7.1 and gcc8-snapshot-20170520 -mavx2 -ffast-math -funroll-loops
-fopenmp -O3 emit:

        # omitted integer code to align the stack by 32
        vpxor   %xmm0, %xmm0, %xmm0      # tmp119
        vmovaps %xmm0, -48(%rbp)         # tmp119, MEM[(void *)&D.2373]
        vmovaps %xmm0, -32(%rbp)         # tmp119, MEM[(void *)&D.2373]
        vmovaps -48(%rbp), %ymm8         # MEM[(float *)&D.2373], vect__23.20
        # then the loop

The store-forwarding stall part of this is very similar to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

With gcc4/5/6, we get four integer stores like movq $0, -48(%rbp) before the
vector load.  Either way, this is ridiculous because vpxor already zeroed the
whole ymm vector, and causes a store-forwarding stall.

It's also silly because -ffast-math allows 0.0+x to optimize to x.  It could
start off by simply loading the first vector instead of adding it to 0, in this
special case where the loop count is a compile-time constant.

---

It's not just AVX 256b vectors, although it's far worse there.

With just SSE2:

        pxor    %xmm0, %xmm0
        movaps  %xmm0, -24(%rsp)    # dead store
        pxor    %xmm0, %xmm0


Or from older gcc versions, the same int store->vector reload
store-forwarding-stall inducing code.

-----


Even though -funroll-loops unrolls, it doesn't use multiple accumulators to run
separate dep chains in parallel.  So it still bottlenecks on the 4 cycle
latency of VADDPS, instead of the 0.5c throughput (Example numbers for Intel
Skylake, and yes the results are the same with -march=skylake).  clang uses 4
vector accumulators, so it runs 4x faster when data is hot in L2.  (up to 8x is
possible with data hot in L1).  Intel pre-skylake has VADDPS latency=3c
throughput=1c, so there's still a factor of three to be had.  But Haswell has
FMA lat=5c, tput=0.5, so you need 10 accumulators to saturate the 2 load&FMA
per clock max throughput.

Is there already an open bug for this?  I know it's totally separate from this
issue.

Reply via email to