I wrote the Julia scripts below that just test the performance of the
matrix multiplication in four cases (A' stands for A transpose, and B'
stands for B transpose):
1. A x B
2. A' x B
3. A x B'
4. A' x B'
I ran these scripts multiplying two 1,000 x 1,000 matrices 20 times, i.e.:
1. axb.jl -n 1000 -k 1000 -m 1000 -t 20
2. atxb.jl -n 1000 -k 1000 -m 1000 -t 20
3. axbt.jl -n 1000 -k 1000 -m 1000 -t 20
4. atxbt.jl -n 1000 -k 1000 -m 1000 -t 20
Cases 1, 2, 3 run in about the same time with a performance of about 70
GFLOPS on my desktop, however case 4 is about 10 times slower.
I run Linux Fedora 23, julia version 0.4.3, and openblas version 0.2.15.
I also compiled Julia directly from the latest version on GitHub, used
Intel MKL, and the results are similar (the absolute times are different,
but the A' x B' case is still about 10 times slower than the other three
cases).
Since Julia uses openblas/MKL function 'dgemm' (or 'sgemm' in the single
precision case), I also wrote the same examples in C directly invoking
openblas/MKL 'cblas_dgemm' function and there's no difference in these four
cases using the native calls (the arguments for the function 'cblas_dgemm'
have transpose flags for both A and B).
I am not sure why Julia is so much slower multiplying A' x B', so I thought
I would bring it to your attention.
Regards,
Franco Venturi
#!/bin/julia -q
using ArgParse
s = ArgParseSettings()
@add_arg_table s begin
"-m"
"-k"
"-n"
"-t"
end
parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])
a = 2 * rand(m, k) - 1
b = 2 * rand(k, n) - 1
tstart = time()
for i = 1:t
c = a * b
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)
#!/bin/julia -q
using ArgParse
s = ArgParseSettings()
@add_arg_table s begin
"-m"
"-k"
"-n"
"-t"
end
parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])
a = (2 * rand(m, k) - 1)'
b = 2 * rand(k, n) - 1
tstart = time()
for i = 1:t
c = a' * b
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)
#!/bin/julia -q
using ArgParse
s = ArgParseSettings()
@add_arg_table s begin
"-m"
"-k"
"-n"
"-t"
end
parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])
a = 2 * rand(m, k) - 1
b = (2 * rand(k, n) - 1)'
tstart = time()
for i = 1:t
c = a * b'
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)
#!/bin/julia -q
using ArgParse
s = ArgParseSettings()
@add_arg_table s begin
"-m"
"-k"
"-n"
"-t"
end
parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])
a = (2 * rand(m, k) - 1)'
b = (2 * rand(k, n) - 1)'
tstart = time()
for i = 1:t
c = a' * b'
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)