I'd have to investigate this further, but it seems quite odd that there 
would be a memory bandwidth problem since the matrices are quite small and 
should fit in cache even.
I don't know how julia is using so much memory. 166MB seems really steep.

On Thursday, November 27, 2014 6:47:34 PM UTC+1, Amit Murthy wrote:
>
> I tried after factoring out the cost of sending A and v over, but no dice.
>
> See https://gist.github.com/amitmurthy/3206a4f61cf6cd6000ee
>
> Even with a loop of 4 within do_stuff, same behavior.
>
> I think I found the reason in an old thread - 
> https://groups.google.com/d/msg/julia-users/jlKoEtErRL4/UTN7FSlZDgoJ
>
> To confirm the same, save
>
> ----------------------------
> const A = randn(1000, 1000);
> const numx = 10_000;
> const v = randn(1000, numx);
>
> do_stuff(r::UnitRange) = ([do_stuff(mm) for mm in r]; nothing)
> function do_stuff(mm::Int64)
>     for x in 1:4
>         sum( A * v[:, mm] )
>     end
> end
>
> chunks = Base.splitrange(numx, 4)
>
> do_stuff(chunks[1]);
> ----------------------------
>
> in a new file p.jl and from the command line first run them in parallel 
> like this
>
> julia p.jl &
>
> four times quickly.
>
> The performance is heavily degraded compared to the serial version
>
>  
>
> On Thu, Nov 27, 2014 at 10:35 PM, Erik Schnetter <[email protected] 
> <javascript:>> wrote:
>
>> You need to use addprocs before the first @everywhere. I assume you
>> actually did this, since otherwise you'd have received an error.
>>
>> It seems that your variable A and v are stored on the master, not on
>> the workers. Since they are inputs to do_stuff, they need to be copied
>> there every time. Note that the whole array v is copied although only
>> part of it is accessed. Maybe sending data to 4 processes
>> simultaneously has an overhead, and is somehow much slower than
>> sending the data one at a time.
>>
>> To check whether this is true, you can add a loop within do_stuff to
>> execute the routine multiple times. This would increase the workload,
>> but keep the communication overhead the same.
>>
>> If this is the case, then to remedy this, you would store A and v
>> (better: only the necessary part of v) on all processes instead of
>> copying them.
>>
>> -erik
>>
>>
>> On Thu, Nov 27, 2014 at 10:20 AM,  <[email protected] <javascript:>> 
>> wrote:
>> > Hey everyone,
>> >
>> > I've been looking at parallel programming in julia and was getting some 
>> very
>> > unexpected results and rather bad performance because of this.
>> > Sadly I ran out of ideas of what could be going on, disproving all 
>> ideas I
>> > had. Hence this post :)
>> >
>> > I was able to construct a similar (simpler) example which exhibits the 
>> same
>> > behavior (attached file).
>> > The example is a very naive and suboptimal implementation in many ways 
>> (the
>> > actual code is much more optimal), but that's not the issue.
>> >
>> > The issue I'm trying to investigate is the big difference in worker time
>> > when a single worker is active and when multiple are active.
>> >
>> > Ideas I disproved:
>> >   - julia processes pinned to a single core
>> >   - julia process uses multiple threads to do the work, and processes 
>> are
>> > fighting for the cores
>> >   - not enough cores on the machine (there are plenty)
>> >   - htop nicely shows 4 julia processes working on different cores
>> >   - there is no communication at the application level stalling anyone
>> >
>> > All I'm left with now is that julia is doing some hidden synchronization
>> > somewhere.
>> > Any input is appreciated. Thanks in advance.
>> >
>> > Kind regards,
>> > Tom
>>
>>
>>
>> --
>> Erik Schnetter <[email protected] <javascript:>>
>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>
>

Reply via email to