I tried after factoring out the cost of sending A and v over, but no dice. See https://gist.github.com/amitmurthy/3206a4f61cf6cd6000ee
Even with a loop of 4 within do_stuff, same behavior. I think I found the reason in an old thread - https://groups.google.com/d/msg/julia-users/jlKoEtErRL4/UTN7FSlZDgoJ To confirm the same, save ---------------------------- const A = randn(1000, 1000); const numx = 10_000; const v = randn(1000, numx); do_stuff(r::UnitRange) = ([do_stuff(mm) for mm in r]; nothing) function do_stuff(mm::Int64) for x in 1:4 sum( A * v[:, mm] ) end end chunks = Base.splitrange(numx, 4) do_stuff(chunks[1]); ---------------------------- in a new file p.jl and from the command line first run them in parallel like this julia p.jl & four times quickly. The performance is heavily degraded compared to the serial version On Thu, Nov 27, 2014 at 10:35 PM, Erik Schnetter <[email protected]> wrote: > You need to use addprocs before the first @everywhere. I assume you > actually did this, since otherwise you'd have received an error. > > It seems that your variable A and v are stored on the master, not on > the workers. Since they are inputs to do_stuff, they need to be copied > there every time. Note that the whole array v is copied although only > part of it is accessed. Maybe sending data to 4 processes > simultaneously has an overhead, and is somehow much slower than > sending the data one at a time. > > To check whether this is true, you can add a loop within do_stuff to > execute the routine multiple times. This would increase the workload, > but keep the communication overhead the same. > > If this is the case, then to remedy this, you would store A and v > (better: only the necessary part of v) on all processes instead of > copying them. > > -erik > > > On Thu, Nov 27, 2014 at 10:20 AM, <[email protected]> wrote: > > Hey everyone, > > > > I've been looking at parallel programming in julia and was getting some > very > > unexpected results and rather bad performance because of this. > > Sadly I ran out of ideas of what could be going on, disproving all ideas > I > > had. Hence this post :) > > > > I was able to construct a similar (simpler) example which exhibits the > same > > behavior (attached file). > > The example is a very naive and suboptimal implementation in many ways > (the > > actual code is much more optimal), but that's not the issue. > > > > The issue I'm trying to investigate is the big difference in worker time > > when a single worker is active and when multiple are active. > > > > Ideas I disproved: > > - julia processes pinned to a single core > > - julia process uses multiple threads to do the work, and processes are > > fighting for the cores > > - not enough cores on the machine (there are plenty) > > - htop nicely shows 4 julia processes working on different cores > > - there is no communication at the application level stalling anyone > > > > All I'm left with now is that julia is doing some hidden synchronization > > somewhere. > > Any input is appreciated. Thanks in advance. > > > > Kind regards, > > Tom > > > > -- > Erik Schnetter <[email protected]> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
