https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

Avinash Jayakar <avinashd at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |avinashd at gcc dot gnu.org

--- Comment #40 from Avinash Jayakar <avinashd at gcc dot gnu.org> ---
I'm interested in solving this bug.

Here are my observations pertaining to the matrix multiplication in gcc, with
O3 flag. 
the following loop (which has delinearized array accesses)

void matmul()
{
  for (int i=0; i < N; i++)
    for (int j=0; j < N; j++)
      for (int k=0; k < N; k++)
        C[i][j] += A[i][k] * B[k][j];
}

with O3 flag, the most important transformation done is loop interchange which
makes the final code faster compared to clang without polly, but still not
anywhere close to BLAS's performance. 

But, in recent years there have been work on reaching close to BLAS like
performance just using compilation
1.
https://github.com/bondhugula/llvm-project/blob/hop/mlir/docs/HighPerfCodeGen.md
2. https://arxiv.org/abs/2305.18236

Also there are specialized units in CPU now like arm's SME, intel's AMX and
powerpc's MMA, which can be used once the loop is identified as a matrix
multiply. So would adding a matmul operator in gimple/rtl make sense like the
llvm.matrix.multiply instrinsic, where each backend could have its own lowering
path to implement the micro kernel in an optimized way?

I believe with graphite it is possible to achieve performance close to BLAS as
mentioned in these articles. The only missing part is packing after tiling is
done, and I am not sure if gcc supports that. 

Anyone more knowledgeable on the graphite code could help clarify this.

Reply via email to