On Tuesday, February 11, 2014 9:18:16 AM UTC-5, Jutho wrote: > > So to make a fair comparison to that c implementation, I have to compare > the Julia speed (10-15 times BLAS speed) with the C speed (1.3 times BLAS > speed) in the first regime, and the Julia speed (100 times BLAS speed) with > the C speed (4 to 5 times BLAS speed) in the second regime. Any idea on > where the big difference between Julia and C is coming from? >
I would do your own C benchmark rather than trusting the one on that web page. For example, it's not clear what BLAS implementation they are using there, and this makes a huge difference. Also, that benchmark was on a fairly old machine, and the difference between optimized BLAS performance and naive 3-loop performance has only increased over time. It may not be particularly meaningful to compare (your Julia)/(your BLAS) to (their C)/(their BLAS). Also, for small sizes, you may want to replace e.g. t1=@elapsed mygemm!(1.,A,B,0.,C) with something like t1=(@elapsed for i=1:100; mygemm!(1.,A,B,0.,C); end)/100 and similarly for the BLAS benchmark, to make sure you get accurate timings.
