I think I've taken steps to minimize parallel overhead by providing only one
function call per process and passing really minimal arguments to the
functions. But the gains in speed don't seem commensurate with the number of
processors. I know that pure linear speedup is too much to hope for, but I
suspect I'm doing something wrong--for example that large data is getting
passed around despite my efforts.
All my code is defined inside a module, though I exercise it from the main REPL.
Single processor (times are representative of multiple samples and exclude
burn-in runs):
julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw)
206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time)
#with 3 workers
julia> myf = RB.makeParallelLL() # code below
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time)
#with 7 workers
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time)
Any suggestions? This on an 8 CPU VM; the underlying hardware has >>8
processors.
Here's some of the code:
mylike(v) = -logitlite(RB.obs, v, 7)
"""
return a matrix whose columns are the start and the end of range of
observations, inclusive.
these chunks are roughly equal sizes, and no id is split between chunks.
n target number of chunks, and thus of rows returned
id id for each row; generally one id may appear in several rows.
all rows with the same id must be contiguous.
"""
function chunks(id, n::Int)
nTot = length(id)
s = nTot/n
cuts = Vector{Int}(n)
cuts[1] = 1
for i = 1:(n-1)
cut::Int = floor(s*i)
while id[cut] == id[cut-1]
cut += 1
end
cuts[i+1] = cut
end
stops = vcat(cuts[2:n]-1, nTot)
hcat(cuts, stops)
end
"""
Parallel helper: evaluate likelihood over limited range
i0..i1, inclusive, is the range.
This is an effort to avoid passing obs between processes.
"""
function cutLL(i0::Int, i1::Int, x)
logitlite(obs[i0:i1,:], x, 7) #obs is a 6413x8 data.frame. Maybe this is
doing an extra copy?
end
"""
Return a function that takes one argument, the optimization parameters.
It evaluates the -LL in parallel
"""
function makeParallelLL()
np = nworkers()
cuts = chunks(obs[:id], np)
function myLL(x)
ll= @parallel (+) for i=1:np
cutLL(cuts[i,1], cuts[i,2], x)
end
-ll
end
end
It might be relevant that logitlite uses a by() to process groups of
observations.
So I think the only thing I'm passing into the workers is a function call,
cutLL, 2 integers, and a 10-element Vector{Float64}, and the only thing going
back is 1 Float64.
Or perhaps fact that I'm using an inner function def (myLL in the last function
definition) is doing something like causing transmission of all variables in
lexical scope?