Hey Eric, The addprocs() is indeed in the wrong place, I must have made a mistake copy-pasting the example.
The overhead in copying A and v is obviously very naive and suboptimal as I pointed out. The real implementation I'm focused on is much more efficient in this regard. The problem I wanted to point out isn't regarding this however. The @time on the workers actually runs after all communication and serialization work is finished and should time just the actual execution. At least this is the behavior I expect/hope it has :) The issue is then that this runtime of the actual execution (without overhead) is 2x larger somehow. Tom On Thursday, November 27, 2014 6:06:06 PM UTC+1, Erik Schnetter wrote: > > You need to use addprocs before the first @everywhere. I assume you > actually did this, since otherwise you'd have received an error. > > It seems that your variable A and v are stored on the master, not on > the workers. Since they are inputs to do_stuff, they need to be copied > there every time. Note that the whole array v is copied although only > part of it is accessed. Maybe sending data to 4 processes > simultaneously has an overhead, and is somehow much slower than > sending the data one at a time. > > To check whether this is true, you can add a loop within do_stuff to > execute the routine multiple times. This would increase the workload, > but keep the communication overhead the same. > > If this is the case, then to remedy this, you would store A and v > (better: only the necessary part of v) on all processes instead of > copying them. > > -erik > > > On Thu, Nov 27, 2014 at 10:20 AM, <[email protected] <javascript:>> > wrote: > > Hey everyone, > > > > I've been looking at parallel programming in julia and was getting some > very > > unexpected results and rather bad performance because of this. > > Sadly I ran out of ideas of what could be going on, disproving all ideas > I > > had. Hence this post :) > > > > I was able to construct a similar (simpler) example which exhibits the > same > > behavior (attached file). > > The example is a very naive and suboptimal implementation in many ways > (the > > actual code is much more optimal), but that's not the issue. > > > > The issue I'm trying to investigate is the big difference in worker time > > when a single worker is active and when multiple are active. > > > > Ideas I disproved: > > - julia processes pinned to a single core > > - julia process uses multiple threads to do the work, and processes > are > > fighting for the cores > > - not enough cores on the machine (there are plenty) > > - htop nicely shows 4 julia processes working on different cores > > - there is no communication at the application level stalling anyone > > > > All I'm left with now is that julia is doing some hidden synchronization > > somewhere. > > Any input is appreciated. Thanks in advance. > > > > Kind regards, > > Tom > > > > -- > Erik Schnetter <[email protected] <javascript:>> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
