Thanks for the quick response, If I understand correctly, this is similar to the first stagnation of http://www.stanford.edu/~jacobm/matrixmultiply.html<http://www.google.com/url?q=http%3A%2F%2Fwww.stanford.edu%2F~jacobm%2Fmatrixmultiply.html&sa=D&sntz=1&usg=AFQjCNESX7Q6ZyhYgfEShN1FqVYliCxOxQ> for values in the range 50-200, at a factor of 1.3 or something times the BLAS speed. I completely overlooked this before.
So to make a fair comparison to that c implementation, I have to compare the Julia speed (10-15 times BLAS speed) with the C speed (1.3 times BLAS speed) in the first regime, and the Julia speed (100 times BLAS speed) with the C speed (4 to 5 times BLAS speed) in the second regime. Any idea on where the big difference between Julia and C is coming from? Best regards, Jutho
