> Can you please elaborate on that? I looked into your code and since you're
> using intrinsics you're dependant on the mercy of the compiler to schedule
> everything right, is the assembler code you're talking about worse than what
> the compiler archieves? Or do you use some faster algorithm to perform the
> matrix multiplication?
To answer all 3 questions
> since you're using intrinsics you're dependant on the mercy of the compiler
> to schedule everything right
I use intrinsics because the compiler cannot derive the best schedule and using
explicit intrinsics avoid being at the mercy of the compiler actually. At least
for SIMD intrinsics compiler do respect what is written in C/C++/Nim and there
is no need to go down to Assembly (which is not true for BigInt or Crypto
intrinsics however). You do need the `restrict`/`{.noalias.}` pragma and also
prefetch intrinsics as well to reach state of the art performance.
> is the assembler code you're talking about worse than what the compiler
> archieves?
No, the assembler code I'm talking about is the fruit of dozens of PhD research
and thousands of man-hours looking into optimizing matrix multiplication.
Compilers are still playing catch up to automatically generate efficient matrix
multiplication kernels and nowadays they even do pattern matching and swap the
manually optimized assembly.[1]
References
(<https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_tiling.nim#L12-L30>)
# Papers:
# [1] Anatomy of High-Performance Matrix Multiplication (Revised)
# Kazushige Goto, Robert A. Van de Geijn
# - http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
#
# [2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
# Smith et al
# - http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
#
# [3] Automating the Last-Mile for High Performance Dense Linear Algebra
# Veras et al
# - https://arxiv.org/pdf/1611.08035.pdf
#
# [4] GEMM: From Pure C to SSE Optimized Micro Kernels
# Michael Lehn
# - http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
#
# Laser wiki - GEMM optimization resources
# - https://github.com/numforge/laser/wiki/GEMM-optimization-resources
Run
That file as a lot of details on how to schedule high performance compute
kernels that compilers can't match at the moment. Note that those sizes assumes
no hyperthreading because with hyperthreading a physical core will have half L1
and L2 caches available per panel leading to many more flushes/reloads.
# We have to take into account vectorisation
# so that the microkernel can be processed with vectorized intrinsics.
#
# Caux [mr, nr] must stay in register.
# - mr ~= nr is optimal to amortize register load cost
# - some registers must be left to prefetch à and ~B (PackedA and PackedB)
# - nr >= (flops/cycle) / (bytes/cycle) * sizeof(element)
#
# For example Haswell is capable of
# - 32 single-precision FLOPs/cycle
# - 32 bytes/cycle store and 64 bytes/cycle load (store C, load A and B)
#
# so nr >= 32/32 * 4
# For that number of FLOP it must schedule
# 2xFMA so consume 16 single-precision float
# so mr*nr >= 16
Run
# Registers constraints and micro-kernel tuning
# - To issue 2xFMAs in parallel we need to use 2x SIMD registers
# - We want to hold C of size MR * NR completely in SIMD registers as well
# as each value is reused k times during accumulation C[i, j] += A[i,
k] * B[k, j]
# - We should have enough SIMD registers left to hold
# the corresponding sections of A and B (at least 4, 2xA and 2xB for
FMAs)
#
# On x86-64 X SIMD registers that can issue 2xFMAs per cycle:
# - NbVecs is 2 minimum
# - RegsPerVec = 2 * NbVecs => 4 minimum (for A and for B)
# - NR = NbVecs * NbScalarsPerSIMD
# - C: MR*NR and uses MR*NbVecs SIMD registers
# - MR*NbVecs + RegsPerVec <= X
# -> MR*NbVecs + 2 * NbVecs <= X
# -> (MR+2) * NbVecs <= X
#
# Some solutions:
# - AVX with 16 registers:
# - MR = 6, NbVecs = 2
# FP32: 8xFP32 per SIMD --> NR = 2x8
# ukernel = 6x16
# FP64, ukernel = 6x8
# - MR = 2, NbVecs = 4
# FP32: 8xFP32 per SIMD --> NR = 4x8
# ukernel = 2x32
# FP64, ukernel = 2x16
# - AVX512 with 32 registers
# - MR = 6, NbVecs = 4
# FP32 ukernel = 6x64
# FP64 ukernel = 6x32
# - MR = 2, NbVecs = 8
# FP32 ukernel = 2x128
# FP64 ukernel = 2x64
# - MR = 14, NbVecs = 2
# FP32 ukernel = 14x32
# FP64 ukernel = 14x16
Run
# The inner microkernel loop does:
# AB[m][n] = A[m] * B[n]
# So n should be the vector size
# if most matrices are row-Major.
# This avoids dealing with transpose
# in the inner loop and untranspose in the epilogue
Run
# ## Panel sizes
# - TLB constraint
# TA ̃ + 2(TBj + TCj)≤T
# Note: optimizing the similar problem mckc/(2mc+2kc)
# under the constraint that mckc ≤ K is the problem
# of maximizing the area of a rectangle
# while minimizing the perimeter,
#
# Goto paper [1] section 6.3: choosing kc
# - kc should be as large as possible to amortize the mr*nr updates of
Cj
# - Elements from Bj [kc, nr] must remain in L1 cache.
# - kc * nr should occupy less than half the L1 cache
# so that à and Caux do not evict element of Bj
# - Ã [kc, mc] should occupy
# a considerable fraction of the L2 cache
# In our experience optimal choice is so that "kc" float64 occupy half
a page
# -> a page is 4096 bytes = 512 float64 -> half a page = 256
# Goto paper [1] section 6.3: choosing mc
# - mc*kc should fill a considerable part of (1) the memory addressable
# by the TLB and (2) the L2 cache
# In practice mc is chosen so that A occupies about half the smaller
of (1) and (2)
Run
> Or do you use some faster algorithm to perform the matrix multiplication
No, I use the default O(n^3) algorithm. The asymptotically faster algorithms
like Strassen may be theoretically faster but implementation "details" make the
default algorithm 100x~1000x faster than a naive implementation already.
I've collated further research into high performance computing optimization in:
* Convolution Optimisation Resources:
<https://github.com/numforge/laser/wiki/Convolution-optimisation-resources> or
<https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md>
* Nested loop automated scheduling:
<https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md>
Includes:
* Expert heuristics driven scheduling
* Machine Learning approaches
* Polyhedral compilation
* Brute-Force search
[1] Regarding pattern matching matrix multiplication, Intel ICC detects nested
triple for-loops and swaps them to their Assembly library MKL. LLVM Polly
(their polyhedral optimizer for nested loop) will detect nested triple
for-loops and apply BLIS transformation instead of automated polyhedral
scheduling (<https://reviews.llvm.org/D21491)>. Laser code architecture is
based on BLIS code at a high-level as well.