I tried after factoring out the cost of sending A and v over, but no dice.

See https://gist.github.com/amitmurthy/3206a4f61cf6cd6000ee

Even with a loop of 4 within do_stuff, same behavior.

I think I found the reason in an old thread -
https://groups.google.com/d/msg/julia-users/jlKoEtErRL4/UTN7FSlZDgoJ

To confirm the same, save

----------------------------
const A = randn(1000, 1000);
const numx = 10_000;
const v = randn(1000, numx);

do_stuff(r::UnitRange) = ([do_stuff(mm) for mm in r]; nothing)
function do_stuff(mm::Int64)
    for x in 1:4
        sum( A * v[:, mm] )
    end
end

chunks = Base.splitrange(numx, 4)

do_stuff(chunks[1]);
----------------------------

in a new file p.jl and from the command line first run them in parallel
like this

julia p.jl &

four times quickly.

The performance is heavily degraded compared to the serial version



On Thu, Nov 27, 2014 at 10:35 PM, Erik Schnetter <[email protected]>
wrote:

> You need to use addprocs before the first @everywhere. I assume you
> actually did this, since otherwise you'd have received an error.
>
> It seems that your variable A and v are stored on the master, not on
> the workers. Since they are inputs to do_stuff, they need to be copied
> there every time. Note that the whole array v is copied although only
> part of it is accessed. Maybe sending data to 4 processes
> simultaneously has an overhead, and is somehow much slower than
> sending the data one at a time.
>
> To check whether this is true, you can add a loop within do_stuff to
> execute the routine multiple times. This would increase the workload,
> but keep the communication overhead the same.
>
> If this is the case, then to remedy this, you would store A and v
> (better: only the necessary part of v) on all processes instead of
> copying them.
>
> -erik
>
>
> On Thu, Nov 27, 2014 at 10:20 AM,  <[email protected]> wrote:
> > Hey everyone,
> >
> > I've been looking at parallel programming in julia and was getting some
> very
> > unexpected results and rather bad performance because of this.
> > Sadly I ran out of ideas of what could be going on, disproving all ideas
> I
> > had. Hence this post :)
> >
> > I was able to construct a similar (simpler) example which exhibits the
> same
> > behavior (attached file).
> > The example is a very naive and suboptimal implementation in many ways
> (the
> > actual code is much more optimal), but that's not the issue.
> >
> > The issue I'm trying to investigate is the big difference in worker time
> > when a single worker is active and when multiple are active.
> >
> > Ideas I disproved:
> >   - julia processes pinned to a single core
> >   - julia process uses multiple threads to do the work, and processes are
> > fighting for the cores
> >   - not enough cores on the machine (there are plenty)
> >   - htop nicely shows 4 julia processes working on different cores
> >   - there is no communication at the application level stalling anyone
> >
> > All I'm left with now is that julia is doing some hidden synchronization
> > somewhere.
> > Any input is appreciated. Thanks in advance.
> >
> > Kind regards,
> > Tom
>
>
>
> --
> Erik Schnetter <[email protected]>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>

Reply via email to