I've been advised to write loops in pure Julia for speed, contrary to the
strategy of using library functions in Python. After spending a couple
hundred hours learning Julia, I am having trouble matching Python's speed
with Julia. Here is code for a benchmark in Python:
import numpy as np
import numexpr as ne
from scipy.linalg.blas import zdotu
from timeit import default_timer as timer
n = 10000000
reps = 50
x = -1J * np.ones(n)
y = 1J * np.ones(n)
res = np.empty(reps, dtype=np.complex128)
start = timer()
for k in range(reps):
scaler = float(k)
cexp = ne.evaluate('exp(scaler * x)')
res[k] = zdotu(cexp, y)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 1)), "seconds")
On my Intel Core-i7 Windows PC with 64-bit Python 3.4, this took 4.5
seconds to execute. I note that Python's numexpr library function uses all
8 CPU threads, while BLAS' zdotu() is single-threaded.
Here is my single-threaded Julia code for the same benchmark:
function innerloop(x::Array{Complex{Float64}}, y::Array{Complex{Float64}},
scaler::Float64)
n = length(x)
assert(length(y) == n)
s = 0.0 + 0.0im
@simd for i in 1:n
@inbounds s += y[i] * exp(scaler * x[i])
end
s
end
function timeit(n::Int, reps::Int)
x = fill(0.0-1.0im, n)
y = fill(0.0+1.0im, n)
res = Array(Complex{Float64}, reps)
@time for k in 1:reps
scaler = convert(Float64, k)
res[k] = innerloop(x, y, scaler)
end
end
timeit(100, 1)
timeit(10000000, 50)
On the same PC with 64-bit Julia 0.3.7, the second call to timeit() took
17.6 seconds and allocated 0 bytes. That is 3.9 times slower than Python.
I note that the @simd decorator didn't speed up the execution.
Python performance seems in reach if I can get a parallelized Julia
implementation. Here is my attempt:
function innerloop(x::Array{Complex{Float64}}, y::Array{Complex{Float64}},
scaler::Float64)
n = length(x)
assert(length(y) == n)
s = @parallel (+) for i in 1:n
y[i] * exp(scaler * x[i])
end
s
end
The timeit() function remains the same. Running Julia with the "-p 8" on
the command line, the second call to timeit() took 233 seconds and
allocated 303GB. The huge allocation is my clue that something is really
wrong here. All my attempts at adding @inbounds created errors.
I should say that this benchmark is representative of a real problem that I
am trying to solve. My actual Python code takes 10 minutes to execute
because n and reps are larger. However, the types of the vectors match my
actual code, and scaler changes with each rep.
Please show me by example how to modify innerloop() to be faster than the
single-threaded version. My preference is for a fix with Julia 0.3.7, but
I am also interested if someone provides Julia 0.4-pre code including a
speed comparison with the single-threaded version on their machine. If
possible, I would like to avoid SharedArrays because (I believe) they can't
be used with BLAS calls in other functions.