Jutho, you may want to check out generic_matmatmul deep inside Base; it 
implements a cache-friendly multiplication algorithm. It's possible that 
you'll find it's faster yet.

--Tim

On Tuesday, February 11, 2014 08:43:30 AM Jiahao Chen wrote:
> What you are seeing in going from n<500 to n>500 is likely that the
> problem is too large to fit in (L3?) cache, as an individual Float64
> matrix takes up ~500*500*8 bytes ~ 2MB and so three matrices will take
> about ~6MB of memory and start to fill up a 6-8 MB cache. Writing
> matrix multiplies that fit into cache nicely is the kind of kernel for
> which people spend a lot of time optimizing for. The ATLAS library in
> particular made explicit use of this observation that cache coherence
> is important for performance to generate a lot of specialized linear
> algebra kernels:
> 
> http://web.eecs.utk.edu/~rwhaley/ATL/INDEX.HTM
> Thanks,
> 
> Jiahao Chen, PhD
> Staff Research Scientist
> MIT Computer Science and Artificial Intelligence Laboratory
> 
> On Tue, Feb 11, 2014 at 8:09 AM, Jutho <[email protected]> wrote:
> > Dear Julia users/developers,
> > 
> > for reasons that I will not go into right now, I wanted to investigate how
> > much one loses by defining matrix multiplication within Julia over using
> > the BLAS functionality. Doing this in C, I found the following reference:
> > http://www.stanford.edu/~jacobm/matrixmultiply.html
> > which shows that the speed gain of BLAS stagnates to about a factor 4 or 5
> > of the native C algorithm without any special tricks (except for maybe the
> > -O2 -funroll-loops specified to the compiler). While this page is very
> > brief in detail, I assume that the BLAS matrix multiplication was run
> > single-threaded. I don't know if there is any coincidence in the factor 4,
> > i.e. whether this has anything to do with the use of SIMD instructions in
> > BLAS.
> > 
> > I wanted to do the same comparison using native Julia, so I created the
> > following IJulia notebook to show the results.
> > http://nbviewer.ipython.org/gist/Jutho/8934314
> > 
> > I repeated 5 timings for a bunch of values of the matrix dimension (all
> > square matrices). First two plots show the Julia timings (respectively
> > BLAS
> > timings) divided by matrix size to the power 3, which should stagnate to a
> > constant. This is the case for BLAS, and also for Julia, but there seems
> > to
> > be a first stagnation around matrix dimension 50-400, and then a sudden
> > jump to a much larger value in the range 600-2000.
> > 
> > The third plot shows the speedup of BLAS (single threaded) over the Julia
> > algorithm, which also reproduces the big jump around matrix dimension 500.
> > For smaller matrix dimensions, the speedup is of the order 10 to 15, which
> > is larger but at least in the same order of magnitude as the C comparison
> > on the cited webpage. But then, for matrix dimensions larger than 500,
> > the Julia version seems to become a factor 100 slower than BLAS. Is this
> > to be expected?
> > 
> > Best regards,
> > 
> > Jutho

Reply via email to