Re: [julia-users] Parallel programming weird overhead

Erik Schnetter Mon, 01 Dec 2014 14:09:12 -0800

A typical cache is much smaller than these matrices. And the L3 cache
is probably shared between all cores, so that you need to hold four
copies of the data there.


-erik

On Fri, Nov 28, 2014 at 5:22 PM,  <[email protected]> wrote:
> I'd have to investigate this further, but it seems quite odd that there
> would be a memory bandwidth problem since the matrices are quite small and
> should fit in cache even.
> I don't know how julia is using so much memory. 166MB seems really steep.
>
> On Thursday, November 27, 2014 6:47:34 PM UTC+1, Amit Murthy wrote:
>>
>> I tried after factoring out the cost of sending A and v over, but no dice.
>>
>> See https://gist.github.com/amitmurthy/3206a4f61cf6cd6000ee
>>
>> Even with a loop of 4 within do_stuff, same behavior.
>>
>> I think I found the reason in an old thread -
>> https://groups.google.com/d/msg/julia-users/jlKoEtErRL4/UTN7FSlZDgoJ
>>
>> To confirm the same, save
>>
>> ----------------------------
>> const A = randn(1000, 1000);
>> const numx = 10_000;
>> const v = randn(1000, numx);
>>
>> do_stuff(r::UnitRange) = ([do_stuff(mm) for mm in r]; nothing)
>> function do_stuff(mm::Int64)
>>     for x in 1:4
>>         sum( A * v[:, mm] )
>>     end
>> end
>>
>> chunks = Base.splitrange(numx, 4)
>>
>> do_stuff(chunks[1]);
>> ----------------------------
>>
>> in a new file p.jl and from the command line first run them in parallel
>> like this
>>
>> julia p.jl &
>>
>> four times quickly.
>>
>> The performance is heavily degraded compared to the serial version
>>
>>
>>
>> On Thu, Nov 27, 2014 at 10:35 PM, Erik Schnetter <[email protected]>
>> wrote:
>>>
>>> You need to use addprocs before the first @everywhere. I assume you
>>> actually did this, since otherwise you'd have received an error.
>>>
>>> It seems that your variable A and v are stored on the master, not on
>>> the workers. Since they are inputs to do_stuff, they need to be copied
>>> there every time. Note that the whole array v is copied although only
>>> part of it is accessed. Maybe sending data to 4 processes
>>> simultaneously has an overhead, and is somehow much slower than
>>> sending the data one at a time.
>>>
>>> To check whether this is true, you can add a loop within do_stuff to
>>> execute the routine multiple times. This would increase the workload,
>>> but keep the communication overhead the same.
>>>
>>> If this is the case, then to remedy this, you would store A and v
>>> (better: only the necessary part of v) on all processes instead of
>>> copying them.
>>>
>>> -erik
>>>
>>>
>>> On Thu, Nov 27, 2014 at 10:20 AM,  <[email protected]> wrote:
>>> > Hey everyone,
>>> >
>>> > I've been looking at parallel programming in julia and was getting some
>>> > very
>>> > unexpected results and rather bad performance because of this.
>>> > Sadly I ran out of ideas of what could be going on, disproving all
>>> > ideas I
>>> > had. Hence this post :)
>>> >
>>> > I was able to construct a similar (simpler) example which exhibits the
>>> > same
>>> > behavior (attached file).
>>> > The example is a very naive and suboptimal implementation in many ways
>>> > (the
>>> > actual code is much more optimal), but that's not the issue.
>>> >
>>> > The issue I'm trying to investigate is the big difference in worker
>>> > time
>>> > when a single worker is active and when multiple are active.
>>> >
>>> > Ideas I disproved:
>>> >   - julia processes pinned to a single core
>>> >   - julia process uses multiple threads to do the work, and processes
>>> > are
>>> > fighting for the cores
>>> >   - not enough cores on the machine (there are plenty)
>>> >   - htop nicely shows 4 julia processes working on different cores
>>> >   - there is no communication at the application level stalling anyone
>>> >
>>> > All I'm left with now is that julia is doing some hidden
>>> > synchronization
>>> > somewhere.
>>> > Any input is appreciated. Thanks in advance.
>>> >
>>> > Kind regards,
>>> > Tom
>>>
>>>
>>>
>>> --
>>> Erik Schnetter <[email protected]>
>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>>
>



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

Re: [julia-users] Parallel programming weird overhead

Reply via email to