unless I am completely mistaken, you could also rearrange the terms in the matrix multiply to do the multiplication completely column-wise for both (since there's no reason you must compute r[i] all at once, you already have a perfect location for storing a temporary value if needed: the output array itself)
however, that would ignores the effect of the cpu's memory caches on this operation. Pictorially, for the 2x2 A*x case, we can compute using any combination of the following options: |A B| * |X| |C D| |Y| = |A*X| + |B*Y| |C*X| |D*Y| = |A*X+B*Y| |C*X+D*Y| = #= or a crazy option: =# |A*X+B*Y| + | 0| | D*Y| + |C*X| All of these are simple term-rearrangements (unless I made a typo), but are intended to show various options for the order of operations which achieve the same answer. If I have understood correctly, an optimized BLAS actually ends up "blocking" the matrix and doing the matrix multiply in small chunks (rather than strictly column-wise or row-wise) to achieve maximal efficiency These sorts of tricks are why a BLAS algorithm is highly specific to a CPU, and you end up with a massive openblas shared library when compiled with dynamic architecture support (the Julia default) as compared to without that support. Looking at the technical side, to figure out the optimal combination, you have to understand the impact of very low-level hardware details, such as the size of your cpu's cache-line (I think is usually on the order of magnitude of 8 bytes). When you computer requests a byte of memory, what it actually gets is an entire cache line. Accessing the data in the right order is faster simply because it uses all of the data in that cache line. On Mon, Jul 14, 2014 at 6:56 PM, John Myles White <[email protected]> wrote: > Basic linear algebra is done by BLAS and LAPACK, so I don't think these > issues cause performance problems. > > -- John > > On Jul 14, 2014, at 6:55 PM, Andrei Zh <[email protected]> wrote: > > > Linear transformation is super-frequent operation in scientific > computations, so I'd like to know how to perform it in a most efficient > way. Often linear transformation is written as: > > > > r = A * x > > > > where A is transformation matrix and x is a data vector. I don't know > how exactly matrix-by-vector multiplication is implemented in Julia, but I > believe for vectors it reduces to something like this (please correct me if > this it algorithmically wrong): > > > > r = zeros(size(x)) > > for i=1:length(x) > > r[i] = dot(A[i, :], x) > > end > > > > Important part here is "A[i, :]", that is, transformation matrix is > accessed row by row. Since matrices in Julia are stored column-wise, > row-wise access is pretty inefficient. > > > > I thought about a trick with multiplication order. Instead of > calculating "A * x" I can store transformed version of A, say, B = A', and > compute transformation as follows: > > > > r = (x' * B)' > > > > B here is accessed column by column, so it should work well. > > > > But what if instead of single vector "x" we have data matrix "X"? In > this case both - "A * X" and "X * A" are inefficient, since one of these > matrices will be accessed row by row. In NumPy I would just set C order > (row-wise) of elements to the first matrix and Fortran order (column-wise) > for the second one, but AFAIK there's no such option in Julia (is it?). So > what's the most efficient way to perform linear transformation defined by > matrix A on data matrix X? > >
