I'd have to investigate this further, but it seems quite odd that there would be a memory bandwidth problem since the matrices are quite small and should fit in cache even. I don't know how julia is using so much memory. 166MB seems really steep.
On Thursday, November 27, 2014 6:47:34 PM UTC+1, Amit Murthy wrote: > > I tried after factoring out the cost of sending A and v over, but no dice. > > See https://gist.github.com/amitmurthy/3206a4f61cf6cd6000ee > > Even with a loop of 4 within do_stuff, same behavior. > > I think I found the reason in an old thread - > https://groups.google.com/d/msg/julia-users/jlKoEtErRL4/UTN7FSlZDgoJ > > To confirm the same, save > > ---------------------------- > const A = randn(1000, 1000); > const numx = 10_000; > const v = randn(1000, numx); > > do_stuff(r::UnitRange) = ([do_stuff(mm) for mm in r]; nothing) > function do_stuff(mm::Int64) > for x in 1:4 > sum( A * v[:, mm] ) > end > end > > chunks = Base.splitrange(numx, 4) > > do_stuff(chunks[1]); > ---------------------------- > > in a new file p.jl and from the command line first run them in parallel > like this > > julia p.jl & > > four times quickly. > > The performance is heavily degraded compared to the serial version > > > > On Thu, Nov 27, 2014 at 10:35 PM, Erik Schnetter <[email protected] > <javascript:>> wrote: > >> You need to use addprocs before the first @everywhere. I assume you >> actually did this, since otherwise you'd have received an error. >> >> It seems that your variable A and v are stored on the master, not on >> the workers. Since they are inputs to do_stuff, they need to be copied >> there every time. Note that the whole array v is copied although only >> part of it is accessed. Maybe sending data to 4 processes >> simultaneously has an overhead, and is somehow much slower than >> sending the data one at a time. >> >> To check whether this is true, you can add a loop within do_stuff to >> execute the routine multiple times. This would increase the workload, >> but keep the communication overhead the same. >> >> If this is the case, then to remedy this, you would store A and v >> (better: only the necessary part of v) on all processes instead of >> copying them. >> >> -erik >> >> >> On Thu, Nov 27, 2014 at 10:20 AM, <[email protected] <javascript:>> >> wrote: >> > Hey everyone, >> > >> > I've been looking at parallel programming in julia and was getting some >> very >> > unexpected results and rather bad performance because of this. >> > Sadly I ran out of ideas of what could be going on, disproving all >> ideas I >> > had. Hence this post :) >> > >> > I was able to construct a similar (simpler) example which exhibits the >> same >> > behavior (attached file). >> > The example is a very naive and suboptimal implementation in many ways >> (the >> > actual code is much more optimal), but that's not the issue. >> > >> > The issue I'm trying to investigate is the big difference in worker time >> > when a single worker is active and when multiple are active. >> > >> > Ideas I disproved: >> > - julia processes pinned to a single core >> > - julia process uses multiple threads to do the work, and processes >> are >> > fighting for the cores >> > - not enough cores on the machine (there are plenty) >> > - htop nicely shows 4 julia processes working on different cores >> > - there is no communication at the application level stalling anyone >> > >> > All I'm left with now is that julia is doing some hidden synchronization >> > somewhere. >> > Any input is appreciated. Thanks in advance. >> > >> > Kind regards, >> > Tom >> >> >> >> -- >> Erik Schnetter <[email protected] <javascript:>> >> http://www.perimeterinstitute.ca/personal/eschnetter/ >> > >
