Short answer is like John said, BLAS does the right thing internally for different array layouts, whether you're transposing one or the other or both inputs, etc.
Long answer: Download this cheat sheet. http://www.netlib.org/blas/blasqr.ps Print it out, pin it to your office wall. This is BLAS, mathematically and as an API it fits on one page. (LAPACK is much bigger, that can be a messy pile on your desk.) You don't need to use this Fortran-style API if you're used to doing things at a high level in NumPy or Matlab or Julia, but it's there for you to access in Base.LinAlg.BLAS once you learn how it works. I'm going to assume in the following that you are asking about dense matrices with homogenous element types, either Float32, Float64, Complex64, or Complex128. And that your matrices are large enough that they don't completely fit in your processor's cache. There are 3 levels of BLAS operations: BLAS1 means O(n) work on O(n) data, things like scaling and adding vectors together. Mostly just a for loop, not that interesting. BLAS2 means O(n^2) work on O(n^2) data, like multiplying a matrix by a vector, or back-substituting A x = b when A is a triangular matrix. You're doing more work here and you want to be careful that you loop through the array in the most efficient order, accessing adjacent indices is better on the cache. There's not too much optimization you can do here though, since you need to go through the whole matrix and do only a small amount of work on each element. BLAS3 means O(n^3) work on O(n^2) data, like matrix-matrix multiplication. This is where the interesting work happens in making an optimized BLAS implementation - Julia uses OpenBLAS by default, except in the Mac binaries where Apple's Accelerate is used. On modern processors, doing a floating-point operation is actually very cheap. The dominant cost is moving the data between memory and cache, and between cache and the processor. In BLAS3 operations, since you're doing multiple calculations per entry of the input matrices, you have major potential for speeding things up by carefully arranging the calculation to make most effective use of the cache. You want to do as much work as possible on that same set of data without having to go fetch some other entry of the matrix. This is done with careful tiling and blocking and copying one piece of the matrix at a time, with liberal use of SIMD instructions for the best possible performance on any given type of processor. It ends up looking nothing like the naive for loop that the basic mathematical statement of matrix multiplication says it should be. It calculates the same answer in the end (subject to roundoff error), but gets there by a very roundabout way for the sake of speed. It's a hard problem, but well-solved by a good BLAS implementation. On Monday, July 14, 2014 3:55:07 PM UTC-7, Andrei Zh wrote: > > Linear transformation is super-frequent operation in scientific > computations, so I'd like to know how to perform it in a most efficient > way. Often linear transformation is written as: > > r = A * x > > where A is transformation matrix and x is a data vector. I don't know how > exactly matrix-by-vector multiplication is implemented in Julia, but I > believe for vectors it reduces to something like this (please correct me if > this it algorithmically wrong): > > r = zeros(size(x)) > for i=1:length(x) > r[i] = dot(A[i, :], x) > end > > Important part here is "A[i, :]", that is, transformation matrix is > accessed row by row. Since matrices in Julia are stored column-wise, > row-wise access is pretty inefficient. > > I thought about a trick with multiplication order. Instead of calculating > "A * x" I can store transformed version of A, say, B = A', and compute > transformation as follows: > > r = (x' * B)' > > B here is accessed column by column, so it should work well. > > But what if instead of single vector "x" we have data matrix "X"? In this > case both - "A * X" and "X * A" are inefficient, since one of these > matrices will be accessed row by row. In NumPy I would just set C order > (row-wise) of elements to the first matrix and Fortran order (column-wise) > for the second one, but AFAIK there's no such option in Julia (is it?). So > what's the most efficient way to perform linear transformation defined by > matrix A on data matrix X? >
