What you are seeing in going from n<500 to n>500 is likely that the problem is too large to fit in (L3?) cache, as an individual Float64 matrix takes up ~500*500*8 bytes ~ 2MB and so three matrices will take about ~6MB of memory and start to fill up a 6-8 MB cache. Writing matrix multiplies that fit into cache nicely is the kind of kernel for which people spend a lot of time optimizing for. The ATLAS library in particular made explicit use of this observation that cache coherence is important for performance to generate a lot of specialized linear algebra kernels:
http://web.eecs.utk.edu/~rwhaley/ATL/INDEX.HTM Thanks, Jiahao Chen, PhD Staff Research Scientist MIT Computer Science and Artificial Intelligence Laboratory On Tue, Feb 11, 2014 at 8:09 AM, Jutho <[email protected]> wrote: > Dear Julia users/developers, > > for reasons that I will not go into right now, I wanted to investigate how > much one loses by defining matrix multiplication within Julia over using the > BLAS functionality. Doing this in C, I found the following reference: > http://www.stanford.edu/~jacobm/matrixmultiply.html > which shows that the speed gain of BLAS stagnates to about a factor 4 or 5 > of the native C algorithm without any special tricks (except for maybe the > -O2 -funroll-loops specified to the compiler). While this page is very brief > in detail, I assume that the BLAS matrix multiplication was run > single-threaded. I don't know if there is any coincidence in the factor 4, > i.e. whether this has anything to do with the use of SIMD instructions in > BLAS. > > I wanted to do the same comparison using native Julia, so I created the > following IJulia notebook to show the results. > http://nbviewer.ipython.org/gist/Jutho/8934314 > > I repeated 5 timings for a bunch of values of the matrix dimension (all > square matrices). First two plots show the Julia timings (respectively BLAS > timings) divided by matrix size to the power 3, which should stagnate to a > constant. This is the case for BLAS, and also for Julia, but there seems to > be a first stagnation around matrix dimension 50-400, and then a sudden jump > to a much larger value in the range 600-2000. > > The third plot shows the speedup of BLAS (single threaded) over the Julia > algorithm, which also reproduces the big jump around matrix dimension 500. > For smaller matrix dimensions, the speedup is of the order 10 to 15, which > is larger but at least in the same order of magnitude as the C comparison on > the cited webpage. But then, for matrix dimensions larger than 500, the > Julia version seems to become a factor 100 slower than BLAS. Is this to be > expected? > > Best regards, > > Jutho
