Wow, excellent explanation! Thanks for all your answers! And yes, I'm definitely going to print out BLAS cheat sheet :)
(By the way, I'm really impressed how friendly and helpful Julia community is. I hope it will stay the same when it grows up N times larger!) On Tue, Jul 15, 2014 at 4:09 AM, Tony Kelman <[email protected]> wrote: > Short answer is like John said, BLAS does the right thing internally for > different array layouts, whether you're transposing one or the other or > both inputs, etc. > > Long answer: > > Download this cheat sheet. http://www.netlib.org/blas/blasqr.ps Print it > out, pin it to your office wall. This is BLAS, mathematically and as an API > it fits on one page. (LAPACK is much bigger, that can be a messy pile on > your desk.) You don't need to use this Fortran-style API if you're used to > doing things at a high level in NumPy or Matlab or Julia, but it's there > for you to access in Base.LinAlg.BLAS once you learn how it works. > > I'm going to assume in the following that you are asking about dense > matrices with homogenous element types, either Float32, Float64, Complex64, > or Complex128. And that your matrices are large enough that they don't > completely fit in your processor's cache. > > There are 3 levels of BLAS operations: BLAS1 means O(n) work on O(n) data, > things like scaling and adding vectors together. Mostly just a for loop, > not that interesting. > > BLAS2 means O(n^2) work on O(n^2) data, like multiplying a matrix by a > vector, or back-substituting A x = b when A is a triangular matrix. You're > doing more work here and you want to be careful that you loop through the > array in the most efficient order, accessing adjacent indices is better on > the cache. There's not too much optimization you can do here though, since > you need to go through the whole matrix and do only a small amount of work > on each element. > > BLAS3 means O(n^3) work on O(n^2) data, like matrix-matrix multiplication. > This is where the interesting work happens in making an optimized BLAS > implementation - Julia uses OpenBLAS by default, except in the Mac binaries > where Apple's Accelerate is used. On modern processors, doing a > floating-point operation is actually very cheap. The dominant cost is > moving the data between memory and cache, and between cache and the > processor. In BLAS3 operations, since you're doing multiple calculations > per entry of the input matrices, you have major potential for speeding > things up by carefully arranging the calculation to make most effective use > of the cache. You want to do as much work as possible on that same set of > data without having to go fetch some other entry of the matrix. > > This is done with careful tiling and blocking and copying one piece of the > matrix at a time, with liberal use of SIMD instructions for the best > possible performance on any given type of processor. It ends up looking > nothing like the naive for loop that the basic mathematical statement of > matrix multiplication says it should be. It calculates the same answer in > the end (subject to roundoff error), but gets there by a very roundabout > way for the sake of speed. It's a hard problem, but well-solved by a good > BLAS implementation. > > > On Monday, July 14, 2014 3:55:07 PM UTC-7, Andrei Zh wrote: >> >> Linear transformation is super-frequent operation in scientific >> computations, so I'd like to know how to perform it in a most efficient >> way. Often linear transformation is written as: >> >> r = A * x >> >> where A is transformation matrix and x is a data vector. I don't know how >> exactly matrix-by-vector multiplication is implemented in Julia, but I >> believe for vectors it reduces to something like this (please correct me if >> this it algorithmically wrong): >> >> r = zeros(size(x)) >> for i=1:length(x) >> r[i] = dot(A[i, :], x) >> end >> >> Important part here is "A[i, :]", that is, transformation matrix is >> accessed row by row. Since matrices in Julia are stored column-wise, >> row-wise access is pretty inefficient. >> >> I thought about a trick with multiplication order. Instead of calculating >> "A * x" I can store transformed version of A, say, B = A', and compute >> transformation as follows: >> >> r = (x' * B)' >> >> B here is accessed column by column, so it should work well. >> >> But what if instead of single vector "x" we have data matrix "X"? In this >> case both - "A * X" and "X * A" are inefficient, since one of these >> matrices will be accessed row by row. In NumPy I would just set C order >> (row-wise) of elements to the first matrix and Fortran order (column-wise) >> for the second one, but AFAIK there's no such option in Julia (is it?). So >> what's the most efficient way to perform linear transformation defined by >> matrix A on data matrix X? >> >
