The following figure is part of the results of profiling a performance-critical piece of code:
<https://lh6.googleusercontent.com/-3dfRngVQZGo/VK3BTDrwPaI/AAAAAAAABuc/2zMSWI12e9w/s1600/profile.png> * The pink section corresponds to a single line: G = - G * C where G, C are moderate size N x N matrices; this line is called O(N) times. * brown block is `gemm_wrapper!` * the green block is the `-` operation in array.jl QUESTIONS: - Is the `- operation` really a higher cost than the matrix multiplication? Why? - What is happening in part of the pink block that has no block above it? - closely related: from which matrix-size onwards are matrix-vector multiplications best performed in BLAS as opposed to for-loops? Many thanks, Christoph
