Dear Julia users/developers,

for reasons that I will not go into right now, I wanted to investigate how 
much one loses by defining matrix multiplication within Julia over using 
the BLAS functionality. Doing this in C, I found the following reference:
http://www.stanford.edu/~jacobm/matrixmultiply.html
which shows that the speed gain of BLAS stagnates to about a factor 4 or 5 
of the native C algorithm without any special tricks (except for maybe 
the -O2 -funroll-loops specified to the compiler). While this page is very 
brief in detail, I assume that the BLAS matrix multiplication was run 
single-threaded. I don't know if there is any coincidence in the factor 4, 
i.e. whether this has anything to do with the use of SIMD instructions in 
BLAS. 

I wanted to do the same comparison using native Julia, so I created the 
following IJulia notebook to show the results. 
http://nbviewer.ipython.org/gist/Jutho/8934314

I repeated 5 timings for a bunch of values of the matrix dimension (all 
square matrices). First two plots show the Julia timings (respectively BLAS 
timings) divided by matrix size to the power 3, which should stagnate to a 
constant. This is the case for BLAS, and also for Julia, but there seems to 
be a first stagnation around matrix dimension 50-400, and then a sudden 
jump to a much larger value in the range 600-2000.

The third plot shows the speedup of BLAS (single threaded) over the Julia 
algorithm, which also reproduces the big jump around matrix dimension 500. 
For smaller matrix dimensions, the speedup is of the order 10 to 15, which 
is larger but at least in the same order of magnitude as the C comparison 
on the cited webpage. But then, for matrix dimensions larger than 500, the 
Julia version seems to become a factor 100 slower than BLAS. Is this to be 
expected?

Best regards,

Jutho

Reply via email to