What you are seeing in going from n<500 to n>500 is likely that the
problem is too large to fit in (L3?) cache, as an individual Float64
matrix takes up ~500*500*8 bytes ~ 2MB and so three matrices will take
about ~6MB of memory and start to fill up a 6-8 MB cache. Writing
matrix multiplies that fit into cache nicely is the kind of kernel for
which people spend a lot of time optimizing for. The ATLAS library in
particular made explicit use of this observation that cache coherence
is important for performance to generate a lot of specialized linear
algebra kernels:

http://web.eecs.utk.edu/~rwhaley/ATL/INDEX.HTM
Thanks,

Jiahao Chen, PhD
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory


On Tue, Feb 11, 2014 at 8:09 AM, Jutho <[email protected]> wrote:
> Dear Julia users/developers,
>
> for reasons that I will not go into right now, I wanted to investigate how
> much one loses by defining matrix multiplication within Julia over using the
> BLAS functionality. Doing this in C, I found the following reference:
> http://www.stanford.edu/~jacobm/matrixmultiply.html
> which shows that the speed gain of BLAS stagnates to about a factor 4 or 5
> of the native C algorithm without any special tricks (except for maybe the
> -O2 -funroll-loops specified to the compiler). While this page is very brief
> in detail, I assume that the BLAS matrix multiplication was run
> single-threaded. I don't know if there is any coincidence in the factor 4,
> i.e. whether this has anything to do with the use of SIMD instructions in
> BLAS.
>
> I wanted to do the same comparison using native Julia, so I created the
> following IJulia notebook to show the results.
> http://nbviewer.ipython.org/gist/Jutho/8934314
>
> I repeated 5 timings for a bunch of values of the matrix dimension (all
> square matrices). First two plots show the Julia timings (respectively BLAS
> timings) divided by matrix size to the power 3, which should stagnate to a
> constant. This is the case for BLAS, and also for Julia, but there seems to
> be a first stagnation around matrix dimension 50-400, and then a sudden jump
> to a much larger value in the range 600-2000.
>
> The third plot shows the speedup of BLAS (single threaded) over the Julia
> algorithm, which also reproduces the big jump around matrix dimension 500.
> For smaller matrix dimensions, the speedup is of the order 10 to 15, which
> is larger but at least in the same order of magnitude as the C comparison on
> the cited webpage. But then, for matrix dimensions larger than 500, the
> Julia version seems to become a factor 100 slower than BLAS. Is this to be
> expected?
>
> Best regards,
>
> Jutho

Reply via email to