I wrote the Julia scripts below that just test the performance of the 
matrix multiplication in four cases (A' stands for A transpose, and B' 
stands for B transpose):

   1. A x B
   2. A' x B
   3. A x B'
   4. A' x B'

I ran these scripts multiplying two 1,000 x 1,000 matrices 20 times, i.e.:


   1. axb.jl -n 1000 -k 1000 -m 1000 -t 20
   2. atxb.jl -n 1000 -k 1000 -m 1000 -t 20
   3. axbt.jl -n 1000 -k 1000 -m 1000 -t 20
   4. atxbt.jl -n 1000 -k 1000 -m 1000 -t 20

Cases 1, 2, 3 run in about the same time with a performance of about 70 
GFLOPS on my desktop, however case 4 is about 10 times slower.
I run Linux Fedora 23, julia version 0.4.3, and openblas version 0.2.15.
I also compiled Julia directly from the latest version on GitHub, used 
Intel MKL, and the results are similar (the absolute times are different, 
but the A' x B' case is still about 10 times slower than the other three 
cases).

Since Julia uses openblas/MKL function 'dgemm' (or 'sgemm' in the single 
precision case), I also wrote the same examples in C directly invoking 
openblas/MKL 'cblas_dgemm' function and there's no difference in these four 
cases using the native calls (the arguments for the function 'cblas_dgemm' 
have transpose flags for both A and B).

I am not sure why Julia is so much slower multiplying A' x B', so I thought 
I would bring it to your attention.

Regards,
Franco Venturi

#!/bin/julia -q

using ArgParse

s = ArgParseSettings()

@add_arg_table s begin
    "-m"
    "-k"
    "-n"
    "-t"
end

parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])

a = 2 * rand(m, k) - 1
b = 2 * rand(k, n) - 1

tstart = time()
for i = 1:t
    c = a * b
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)


#!/bin/julia -q

using ArgParse

s = ArgParseSettings()

@add_arg_table s begin
    "-m"
    "-k"
    "-n"
    "-t"
end

parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])

a = (2 * rand(m, k) - 1)'
b = 2 * rand(k, n) - 1

tstart = time()
for i = 1:t
    c = a' * b
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)



#!/bin/julia -q

using ArgParse

s = ArgParseSettings()

@add_arg_table s begin
    "-m"
    "-k"
    "-n"
    "-t"
end

parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])

a = 2 * rand(m, k) - 1
b = (2 * rand(k, n) - 1)'

tstart = time()
for i = 1:t
    c = a * b'
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)



#!/bin/julia -q

using ArgParse

s = ArgParseSettings()

@add_arg_table s begin
    "-m"
    "-k"
    "-n"
    "-t"
end

parsed_args = parse_args(s)
m = parse(Int, parsed_args["m"])
k = parse(Int, parsed_args["k"])
n = parse(Int, parsed_args["n"])
t = parse(Int, parsed_args["t"])

a = (2 * rand(m, k) - 1)'
b = (2 * rand(k, n) - 1)'

tstart = time()
for i = 1:t
    c = a' * b'
end
tend = time()
duration = (tend - tstart) / t
mflops = (2 * m * n * k) / duration * 1.0e-6
@printf("%dx%dx%d\t%f s\t%f MFLOPS\n", m, n, k, duration, mflops)




Reply via email to