Jutho, you may want to check out generic_matmatmul deep inside Base; it implements a cache-friendly multiplication algorithm. It's possible that you'll find it's faster yet.
--Tim On Tuesday, February 11, 2014 08:43:30 AM Jiahao Chen wrote: > What you are seeing in going from n<500 to n>500 is likely that the > problem is too large to fit in (L3?) cache, as an individual Float64 > matrix takes up ~500*500*8 bytes ~ 2MB and so three matrices will take > about ~6MB of memory and start to fill up a 6-8 MB cache. Writing > matrix multiplies that fit into cache nicely is the kind of kernel for > which people spend a lot of time optimizing for. The ATLAS library in > particular made explicit use of this observation that cache coherence > is important for performance to generate a lot of specialized linear > algebra kernels: > > http://web.eecs.utk.edu/~rwhaley/ATL/INDEX.HTM > Thanks, > > Jiahao Chen, PhD > Staff Research Scientist > MIT Computer Science and Artificial Intelligence Laboratory > > On Tue, Feb 11, 2014 at 8:09 AM, Jutho <[email protected]> wrote: > > Dear Julia users/developers, > > > > for reasons that I will not go into right now, I wanted to investigate how > > much one loses by defining matrix multiplication within Julia over using > > the BLAS functionality. Doing this in C, I found the following reference: > > http://www.stanford.edu/~jacobm/matrixmultiply.html > > which shows that the speed gain of BLAS stagnates to about a factor 4 or 5 > > of the native C algorithm without any special tricks (except for maybe the > > -O2 -funroll-loops specified to the compiler). While this page is very > > brief in detail, I assume that the BLAS matrix multiplication was run > > single-threaded. I don't know if there is any coincidence in the factor 4, > > i.e. whether this has anything to do with the use of SIMD instructions in > > BLAS. > > > > I wanted to do the same comparison using native Julia, so I created the > > following IJulia notebook to show the results. > > http://nbviewer.ipython.org/gist/Jutho/8934314 > > > > I repeated 5 timings for a bunch of values of the matrix dimension (all > > square matrices). First two plots show the Julia timings (respectively > > BLAS > > timings) divided by matrix size to the power 3, which should stagnate to a > > constant. This is the case for BLAS, and also for Julia, but there seems > > to > > be a first stagnation around matrix dimension 50-400, and then a sudden > > jump to a much larger value in the range 600-2000. > > > > The third plot shows the speedup of BLAS (single threaded) over the Julia > > algorithm, which also reproduces the big jump around matrix dimension 500. > > For smaller matrix dimensions, the speedup is of the order 10 to 15, which > > is larger but at least in the same order of magnitude as the C comparison > > on the cited webpage. But then, for matrix dimensions larger than 500, > > the Julia version seems to become a factor 100 slower than BLAS. Is this > > to be expected? > > > > Best regards, > > > > Jutho
