Dear Julia users/developers, for reasons that I will not go into right now, I wanted to investigate how much one loses by defining matrix multiplication within Julia over using the BLAS functionality. Doing this in C, I found the following reference: http://www.stanford.edu/~jacobm/matrixmultiply.html which shows that the speed gain of BLAS stagnates to about a factor 4 or 5 of the native C algorithm without any special tricks (except for maybe the -O2 -funroll-loops specified to the compiler). While this page is very brief in detail, I assume that the BLAS matrix multiplication was run single-threaded. I don't know if there is any coincidence in the factor 4, i.e. whether this has anything to do with the use of SIMD instructions in BLAS.
I wanted to do the same comparison using native Julia, so I created the following IJulia notebook to show the results. http://nbviewer.ipython.org/gist/Jutho/8934314 I repeated 5 timings for a bunch of values of the matrix dimension (all square matrices). First two plots show the Julia timings (respectively BLAS timings) divided by matrix size to the power 3, which should stagnate to a constant. This is the case for BLAS, and also for Julia, but there seems to be a first stagnation around matrix dimension 50-400, and then a sudden jump to a much larger value in the range 600-2000. The third plot shows the speedup of BLAS (single threaded) over the Julia algorithm, which also reproduces the big jump around matrix dimension 500. For smaller matrix dimensions, the speedup is of the order 10 to 15, which is larger but at least in the same order of magnitude as the C comparison on the cited webpage. But then, for matrix dimensions larger than 500, the Julia version seems to become a factor 100 slower than BLAS. Is this to be expected? Best regards, Jutho
