https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

            Bug ID: 109587
           Summary: Deeply nested loop unrolling overwhelms register
                    allocator
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

On matrix multiplication routines such as 

#include <arm_neon.h>

template<int N, int M, int K>
void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t
*c) {
    for (int i = 0; i < N; ++i) {
        for (int j=0; j < M; ++j) {
            for (int k=0; k < K; ++k) {
                c[i*N + j] += a[i*K + k] * b[k*M + j];
            }
        }
    }
}

template void f<16, 16, 16>(const float32_t *__restrict a, const float32_t
*__restrict b, float32_t *c);

the loop unroller fully unrolls the inner loop because the iteration count 16
is below the threshold.  But especially because this results in a RMW operation
we don't have enough registers to deal with it and we spill profoundly.

The loop can be split in two but this requires manual work on each GEMM kernel.

Perhaps the loop unroller can use a better heuristic here?

It also looks like adding a pragmas

template<int N, int M, int K>
void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t
*c) {
    for (int i = 0; i < N; ++i) {
        for (int j=0; j < M; ++j) {
#pragma GCC unroll 8
            for (int k=0; k < K; ++k) {
                c[i*N + j] += a[i*K + k] * b[k*M + j];
            }
        }
    }
}

helps but because this blocks cunrolli and instead unrolls in RTL we loose
scheduling and result in not as efficient code.

So can we do better here with early unrolling?

Reply via email to