Short answer is like John said, BLAS does the right thing internally for 
different array layouts, whether you're transposing one or the other or 
both inputs, etc.

Long answer:

Download this cheat sheet. http://www.netlib.org/blas/blasqr.ps Print it 
out, pin it to your office wall. This is BLAS, mathematically and as an API 
it fits on one page. (LAPACK is much bigger, that can be a messy pile on 
your desk.) You don't need to use this Fortran-style API if you're used to 
doing things at a high level in NumPy or Matlab or Julia, but it's there 
for you to access in Base.LinAlg.BLAS once you learn how it works.

I'm going to assume in the following that you are asking about dense 
matrices with homogenous element types, either Float32, Float64, Complex64, 
or Complex128. And that your matrices are large enough that they don't 
completely fit in your processor's cache.

There are 3 levels of BLAS operations: BLAS1 means O(n) work on O(n) data, 
things like scaling and adding vectors together. Mostly just a for loop, 
not that interesting.

BLAS2 means O(n^2) work on O(n^2) data, like multiplying a matrix by a 
vector, or back-substituting A x = b when A is a triangular matrix. You're 
doing more work here and you want to be careful that you loop through the 
array in the most efficient order, accessing adjacent indices is better on 
the cache. There's not too much optimization you can do here though, since 
you need to go through the whole matrix and do only a small amount of work 
on each element.

BLAS3 means O(n^3) work on O(n^2) data, like matrix-matrix multiplication. 
This is where the interesting work happens in making an optimized BLAS 
implementation - Julia uses OpenBLAS by default, except in the Mac binaries 
where Apple's Accelerate is used. On modern processors, doing a 
floating-point operation is actually very cheap. The dominant cost is 
moving the data between memory and cache, and between cache and the 
processor. In BLAS3 operations, since you're doing multiple calculations 
per entry of the input matrices, you have major potential for speeding 
things up by carefully arranging the calculation to make most effective use 
of the cache. You want to do as much work as possible on that same set of 
data without having to go fetch some other entry of the matrix.

This is done with careful tiling and blocking and copying one piece of the 
matrix at a time, with liberal use of SIMD instructions for the best 
possible performance on any given type of processor. It ends up looking 
nothing like the naive for loop that the basic mathematical statement of 
matrix multiplication says it should be. It calculates the same answer in 
the end (subject to roundoff error), but gets there by a very roundabout 
way for the sake of speed. It's a hard problem, but well-solved by a good 
BLAS implementation.


On Monday, July 14, 2014 3:55:07 PM UTC-7, Andrei Zh wrote:
>
> Linear transformation is super-frequent operation in scientific 
> computations, so I'd like to know how to perform it in a most efficient 
> way. Often linear transformation is written as: 
>
>     r = A * x
>
> where A is transformation matrix and x is a data vector. I don't know how 
> exactly matrix-by-vector multiplication is implemented in Julia, but I 
> believe for vectors it reduces to something like this (please correct me if 
> this it algorithmically wrong): 
>     
>     r = zeros(size(x))
>     for i=1:length(x)
>        r[i] = dot(A[i, :], x)
>     end
>
> Important part here is "A[i, :]", that is, transformation matrix is 
> accessed row by row. Since matrices in Julia are stored column-wise, 
> row-wise access is pretty inefficient. 
>
> I thought about a trick with multiplication order. Instead of calculating 
> "A * x" I can store transformed version of A, say, B = A', and compute 
> transformation as follows: 
>
>     r = (x' * B)'
>
> B here is accessed column by column, so it should work well. 
>
> But what if instead of single vector "x" we have data matrix "X"? In this 
> case both - "A * X" and "X * A" are inefficient, since one of these 
> matrices will be accessed row by row. In NumPy I would just set C order 
> (row-wise) of elements to the first matrix and Fortran order (column-wise) 
> for the second one, but AFAIK there's no such option in Julia (is it?). So 
> what's the most efficient way to perform linear transformation defined by 
> matrix A on data matrix X? 
>

Reply via email to