Wow, excellent explanation! Thanks for all your answers! And yes, I'm
definitely going to print out BLAS cheat sheet :)

(By the way, I'm really impressed how friendly and helpful Julia community
is. I hope it will stay the same when it grows up N times larger!)


On Tue, Jul 15, 2014 at 4:09 AM, Tony Kelman <[email protected]> wrote:

> Short answer is like John said, BLAS does the right thing internally for
> different array layouts, whether you're transposing one or the other or
> both inputs, etc.
>
> Long answer:
>
> Download this cheat sheet. http://www.netlib.org/blas/blasqr.ps Print it
> out, pin it to your office wall. This is BLAS, mathematically and as an API
> it fits on one page. (LAPACK is much bigger, that can be a messy pile on
> your desk.) You don't need to use this Fortran-style API if you're used to
> doing things at a high level in NumPy or Matlab or Julia, but it's there
> for you to access in Base.LinAlg.BLAS once you learn how it works.
>
> I'm going to assume in the following that you are asking about dense
> matrices with homogenous element types, either Float32, Float64, Complex64,
> or Complex128. And that your matrices are large enough that they don't
> completely fit in your processor's cache.
>
> There are 3 levels of BLAS operations: BLAS1 means O(n) work on O(n) data,
> things like scaling and adding vectors together. Mostly just a for loop,
> not that interesting.
>
> BLAS2 means O(n^2) work on O(n^2) data, like multiplying a matrix by a
> vector, or back-substituting A x = b when A is a triangular matrix. You're
> doing more work here and you want to be careful that you loop through the
> array in the most efficient order, accessing adjacent indices is better on
> the cache. There's not too much optimization you can do here though, since
> you need to go through the whole matrix and do only a small amount of work
> on each element.
>
> BLAS3 means O(n^3) work on O(n^2) data, like matrix-matrix multiplication.
> This is where the interesting work happens in making an optimized BLAS
> implementation - Julia uses OpenBLAS by default, except in the Mac binaries
> where Apple's Accelerate is used. On modern processors, doing a
> floating-point operation is actually very cheap. The dominant cost is
> moving the data between memory and cache, and between cache and the
> processor. In BLAS3 operations, since you're doing multiple calculations
> per entry of the input matrices, you have major potential for speeding
> things up by carefully arranging the calculation to make most effective use
> of the cache. You want to do as much work as possible on that same set of
> data without having to go fetch some other entry of the matrix.
>
> This is done with careful tiling and blocking and copying one piece of the
> matrix at a time, with liberal use of SIMD instructions for the best
> possible performance on any given type of processor. It ends up looking
> nothing like the naive for loop that the basic mathematical statement of
> matrix multiplication says it should be. It calculates the same answer in
> the end (subject to roundoff error), but gets there by a very roundabout
> way for the sake of speed. It's a hard problem, but well-solved by a good
> BLAS implementation.
>
>
> On Monday, July 14, 2014 3:55:07 PM UTC-7, Andrei Zh wrote:
>>
>> Linear transformation is super-frequent operation in scientific
>> computations, so I'd like to know how to perform it in a most efficient
>> way. Often linear transformation is written as:
>>
>>     r = A * x
>>
>> where A is transformation matrix and x is a data vector. I don't know how
>> exactly matrix-by-vector multiplication is implemented in Julia, but I
>> believe for vectors it reduces to something like this (please correct me if
>> this it algorithmically wrong):
>>
>>     r = zeros(size(x))
>>     for i=1:length(x)
>>        r[i] = dot(A[i, :], x)
>>     end
>>
>> Important part here is "A[i, :]", that is, transformation matrix is
>> accessed row by row. Since matrices in Julia are stored column-wise,
>> row-wise access is pretty inefficient.
>>
>> I thought about a trick with multiplication order. Instead of calculating
>> "A * x" I can store transformed version of A, say, B = A', and compute
>> transformation as follows:
>>
>>     r = (x' * B)'
>>
>> B here is accessed column by column, so it should work well.
>>
>> But what if instead of single vector "x" we have data matrix "X"? In this
>> case both - "A * X" and "X * A" are inefficient, since one of these
>> matrices will be accessed row by row. In NumPy I would just set C order
>> (row-wise) of elements to the first matrix and Fortran order (column-wise)
>> for the second one, but AFAIK there's no such option in Julia (is it?). So
>> what's the most efficient way to perform linear transformation defined by
>> matrix A on data matrix X?
>>
>

Reply via email to