[julia-users] performance of parallel code

Boylan, Ross Sun, 19 Jun 2016 23:28:13 -0700

I think I've taken steps to minimize parallel overhead by providing only one 
function call per process and passing really minimal arguments to the 
functions.  But the gains in speed don't seem commensurate with the number of 
processors.  I know that pure linear speedup is too much to hope for, but I 
suspect I'm doing something wrong--for example that large data is getting 
passed around despite my efforts.


All my code is defined inside a module, though I exercise it from the main REPL.

Single processor (times are representative of multiple samples and exclude 
burn-in runs):
julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw)
 206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time)


#with 3 workers
julia> myf = RB.makeParallelLL()  # code below
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
 182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time)

#with 7 workers
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
  82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time)

Any suggestions?  This on an 8 CPU VM; the underlying hardware has >>8 
processors.

Here's some of the code:

mylike(v) = -logitlite(RB.obs, v, 7)

"""                                                                             
                                                                                
        
return a matrix whose columns are the start and the end of range of 
observations, inclusive.                                                        
                    
these chunks are roughly equal sizes, and no id is  split between chunks.       
                                                                                
        
n  target number of chunks, and thus of rows returned                           
                                                                                
        
id id for each row; generally one id may appear in several rows.                
                                                                                
        
   all rows with the same id must be contiguous.                                
                                                                                
        
"""
function chunks(id, n::Int)
    nTot = length(id)
    s = nTot/n
    cuts = Vector{Int}(n)
    cuts[1] = 1
    for i = 1:(n-1)
        cut::Int = floor(s*i)
        while id[cut] == id[cut-1]
            cut += 1
        end
        cuts[i+1] = cut
    end
    stops = vcat(cuts[2:n]-1, nTot)
    hcat(cuts, stops)
end

"""                                                                             
                                                                                
        
Parallel helper: evaluate likelihood over limited range                         
                                                                                
        
i0..i1, inclusive, is the range.                                                
                                                                                
        
This is an effort to avoid passing obs between processes.                       
                                                                                
        
"""
function cutLL(i0::Int, i1::Int, x)
    logitlite(obs[i0:i1,:], x, 7)  #obs is a 6413x8 data.frame.  Maybe this is 
doing an extra copy?
end

"""                                                                             
                                                                                
        
Return a function that takes one argument, the optimization parameters.         
                                                                                
        
It evaluates the -LL in parallel                                                
                                                                                
        
"""
function makeParallelLL()
    np = nworkers()
    cuts = chunks(obs[:id], np)
    function myLL(x)
       ll= @parallel (+) for i=1:np
            cutLL(cuts[i,1], cuts[i,2], x)
        end
        -ll
    end
end


It might be relevant that logitlite uses a by() to process groups of 
observations.

So I think the only thing I'm passing into the  workers is a function call, 
cutLL, 2 integers, and a 10-element Vector{Float64}, and the only thing going 
back is 1 Float64.

Or perhaps fact that I'm using an inner function def (myLL in the last function 
definition) is doing something like causing transmission of all variables in 
lexical scope?

[julia-users] performance of parallel code

Reply via email to