I can't reproduce this at the REPL or if I put your loop in a function. (I can replicate your result---in my case a factor of 3---if I run your scripts.) But since you're also measuring JIT-compiling time, I'm not sure how seriously to take that (and it's kinda irrelevant anyway, since you never, ever do anything performance sensitive outside a function).
Check out the performance tips page: http://docs.julialang.org/en/stable/manual/performance-tips/ --Tim On Monday, January 18, 2016 10:17:26 AM Franco Venturi wrote: > I wrote the Julia scripts below that just test the performance of the > matrix multiplication in four cases (A' stands for A transpose, and B' > stands for B transpose): > > 1. A x B > 2. A' x B > 3. A x B' > 4. A' x B' > > I ran these scripts multiplying two 1,000 x 1,000 matrices 20 times, i.e.: > > > 1. axb.jl -n 1000 -k 1000 -m 1000 -t 20 > 2. atxb.jl -n 1000 -k 1000 -m 1000 -t 20 > 3. axbt.jl -n 1000 -k 1000 -m 1000 -t 20 > 4. atxbt.jl -n 1000 -k 1000 -m 1000 -t 20 > > Cases 1, 2, 3 run in about the same time with a performance of about 70 > GFLOPS on my desktop, however case 4 is about 10 times slower. > I run Linux Fedora 23, julia version 0.4.3, and openblas version 0.2.15. > I also compiled Julia directly from the latest version on GitHub, used > Intel MKL, and the results are similar (the absolute times are different, > but the A' x B' case is still about 10 times slower than the other three > cases). > > Since Julia uses openblas/MKL function 'dgemm' (or 'sgemm' in the single > precision case), I also wrote the same examples in C directly invoking > openblas/MKL 'cblas_dgemm' function and there's no difference in these four > cases using the native calls (the arguments for the function 'cblas_dgemm' > have transpose flags for both A and B). > > I am not sure why Julia is so much slower multiplying A' x B', so I thought > I would bring it to your attention. > > Regards, > Franco Venturi > > #!/bin/julia -q > > using ArgParse > > s = ArgParseSettings() > > @add_arg_table s begin > "-m" > "-k" > "-n" > "-t" > end > > parsed_args = parse_args(s) > m = parse(Int, parsed_args["m"]) > k = parse(Int, parsed_args["k"]) > n = parse(Int, parsed_args["n"]) > t = parse(Int, parsed_args["t"]) > > a = 2 * rand(m, k) - 1 > b = 2 * rand(k, n) - 1 > > tstart = time() > for i = 1:t > c = a * b > end > tend = time() > duration = (tend - tstart) / t > mflops = (2 * m * n * k) / duration * 1.0e-6 > @printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops) > > > #!/bin/julia -q > > using ArgParse > > s = ArgParseSettings() > > @add_arg_table s begin > "-m" > "-k" > "-n" > "-t" > end > > parsed_args = parse_args(s) > m = parse(Int, parsed_args["m"]) > k = parse(Int, parsed_args["k"]) > n = parse(Int, parsed_args["n"]) > t = parse(Int, parsed_args["t"]) > > a = (2 * rand(m, k) - 1)' > b = 2 * rand(k, n) - 1 > > tstart = time() > for i = 1:t > c = a' * b > end > tend = time() > duration = (tend - tstart) / t > mflops = (2 * m * n * k) / duration * 1.0e-6 > @printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops) > > > > #!/bin/julia -q > > using ArgParse > > s = ArgParseSettings() > > @add_arg_table s begin > "-m" > "-k" > "-n" > "-t" > end > > parsed_args = parse_args(s) > m = parse(Int, parsed_args["m"]) > k = parse(Int, parsed_args["k"]) > n = parse(Int, parsed_args["n"]) > t = parse(Int, parsed_args["t"]) > > a = 2 * rand(m, k) - 1 > b = (2 * rand(k, n) - 1)' > > tstart = time() > for i = 1:t > c = a * b' > end > tend = time() > duration = (tend - tstart) / t > mflops = (2 * m * n * k) / duration * 1.0e-6 > @printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops) > > > > #!/bin/julia -q > > using ArgParse > > s = ArgParseSettings() > > @add_arg_table s begin > "-m" > "-k" > "-n" > "-t" > end > > parsed_args = parse_args(s) > m = parse(Int, parsed_args["m"]) > k = parse(Int, parsed_args["k"]) > n = parse(Int, parsed_args["n"]) > t = parse(Int, parsed_args["t"]) > > a = (2 * rand(m, k) - 1)' > b = (2 * rand(k, n) - 1)' > > tstart = time() > for i = 1:t > c = a' * b' > end > tend = time() > duration = (tend - tstart) / t > mflops = (2 * m * n * k) / duration * 1.0e-6 > @printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)
