Would using pmap() be suitable for your application? Usually when I use
that I get slightly more than a 2X speed up with 4 cores.
On Monday, June 20, 2016 at 2:27:41 AM UTC-4, Boylan, Ross wrote:
>
> I think I've taken steps to minimize parallel overhead by providing only
> one function call per process and passing really minimal arguments to the
> functions. But the gains in speed don't seem commensurate with the number
> of processors. I know that pure linear speedup is too much to hope for,
> but I suspect I'm doing something wrong--for example that large data is
> getting passed around despite my efforts.
>
> All my code is defined inside a module, though I exercise it from the main
> REPL.
>
> Single processor (times are representative of multiple samples and exclude
> burn-in runs):
> julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw)
> 206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time)
>
>
> #with 3 workers
> julia> myf = RB.makeParallelLL() # code below
> julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
> 182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time)
>
> #with 7 workers
> julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
> 82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time)
>
> Any suggestions? This on an 8 CPU VM; the underlying hardware has >>8
> processors.
>
> Here's some of the code:
>
> mylike(v) = -logitlite(RB.obs, v, 7)
>
> """
>
>
> return a matrix whose columns are the start and the end of range of
> observations, inclusive.
>
> these chunks are roughly equal sizes, and no id is split between chunks.
>
>
> n target number of chunks, and thus of rows returned
>
>
> id id for each row; generally one id may appear in several rows.
>
>
> all rows with the same id must be contiguous.
>
>
> """
> function chunks(id, n::Int)
> nTot = length(id)
> s = nTot/n
> cuts = Vector{Int}(n)
> cuts[1] = 1
> for i = 1:(n-1)
> cut::Int = floor(s*i)
> while id[cut] == id[cut-1]
> cut += 1
> end
> cuts[i+1] = cut
> end
> stops = vcat(cuts[2:n]-1, nTot)
> hcat(cuts, stops)
> end
>
> """
>
>
> Parallel helper: evaluate likelihood over limited range
>
>
> i0..i1, inclusive, is the range.
>
>
> This is an effort to avoid passing obs between processes.
>
>
> """
> function cutLL(i0::Int, i1::Int, x)
> logitlite(obs[i0:i1,:], x, 7) #obs is a 6413x8 data.frame. Maybe
> this is doing an extra copy?
> end
>
> """
>
>
> Return a function that takes one argument, the optimization parameters.
>
>
> It evaluates the -LL in parallel
>
>
> """
> function makeParallelLL()
> np = nworkers()
> cuts = chunks(obs[:id], np)
> function myLL(x)
> ll= @parallel (+) for i=1:np
> cutLL(cuts[i,1], cuts[i,2], x)
> end
> -ll
> end
> end
>
>
> It might be relevant that logitlite uses a by() to process groups of
> observations.
>
> So I think the only thing I'm passing into the workers is a function
> call, cutLL, 2 integers, and a 10-element Vector{Float64}, and the only
> thing going back is 1 Float64.
>
> Or perhaps fact that I'm using an inner function def (myLL in the last
> function definition) is doing something like causing transmission of all
> variables in lexical scope?
>
>