Mikael, any type of parallelism has implicit overhead.  You are not doing 
nearly enough work to amortize this overhead.


 On Thursday, March 27, 2014 2:31:14 PM UTC-4, Mikael Simberg wrote:
>
> All right, thanks! That gets rid of the error. 
>  
> If you don't mind though, I'd have some more questions because I'm still 
> doing something wrong or my expectations are too high. So I have again 
> similar code as before:
>  
> function mainfunction(r; single = false)
>  
>     const n = 1000
>     A = SharedArray(Uint, (n, n))
>     for i = 1:n, j = 1:n
>         A[i, j] = rand(Uint)
>     end
>     s = SharedArray(Uint, (nprocs()))
>  
>     i = 1
>     nextidx() = (idx = i; i += 1; idx)
>  
>     if single
>         for i = 1:r
>             parfunction(A, s)
>         end
>     else
>         @sync begin
>             for p in workers()
>                 @async begin
>                     while true
>                         idx = nextidx()
>                         if idx > r
>                             break
>                         end
>                         @spawn parfunction(A, s)
>                     end
>                 end
>             end
>         end
>     end
> end
>  
> function parfunction(A::SharedArray{Uint, 2}, s::SharedArray{Uint, 1})
>     d = sum(A)
>     if rand(0:1000) == 0
>         s[1] = d
>     end
> end
>  
> I start julia with -p 4 and run the function once first and then:
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 8.762498191 seconds (403413200 bytes allocated)
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 7.433603658 seconds (477360360 bytes allocated)
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 7.379368673 seconds (477509296 bytes allocated)
>  
> julia> rmprocs([2, 3, 4, 5])
> :ok
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 7.3204546 seconds (56925308 bytes allocated)
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 10.21160855 seconds (45163964 bytes allocated)
>  
> julia> @time mainfunction(10000, single = false)
> elapsed time: 10.133408252 seconds (44785608 bytes allocated)
>  
> julia> @time mainfunction(10000, single = true)                           
>                                                               
> elapsed time: 9.117599749 seconds (23997456 bytes allocated)
>  
> julia> @time mainfunction(10000, single = true)
> elapsed time: 6.193505021 seconds (23997488 bytes allocated)
>  
> julia> @time mainfunction(10000, single = true)
> elapsed time: 6.189335567 seconds (23997552 bytes allocated)
>  
> So besides the times varying quite a lot, is this as good as I can hope 
> for with a function like mine, i.e. in practice there is no speedup? Also, 
> is what I have above with a separate for loop that just runs on a single 
> process the best way to handle the situation where I know I have only a 
> single process (obviously I know how many I have with nworkers, but I don't 
> know if for example pmap is slower than normal map on just a single 
> process?). 
>  
>  
>  
> On Thu, Mar 27, 2014, at 7:06, Amit Murthy wrote:
>
> Hi Mikael,
>  
> This seems to be a bug in the SharedArray constructor. For sharedarrays of 
> length less than the number of participating pids only the first few pids 
> are used. Since the length of s = SharedArray(Uint, (1)) is 1, it is 
> mapped only on the first process. 
>  
> For now a workaround is to just create s = SharedArray(Uint, (10)) or 
> something and just use the first element. 
>  
>   
>  
> On Thu, Mar 27, 2014 at 7:13 PM, Mikael Simberg 
> <[email protected]<javascript:>
> > wrote:
>  
>
>  Yes, you're at least half-right about it not doing quite what I want. Or 
> let's say I was expecting the majority of the overhead to come from having 
> to send the array over to each process, but what I wasn't expecting was 
> that getting a boolean and an integer back would take so much time (and 
> thus I was expecting using a SharedArray would have been at least 
> comparable to keeping everything local). Indeed, if I just do a remotecall 
> (i.e. without the fetch) it is faster with multiple processes which is what 
> I was expecting.  
>  
> What I essentially want to do in the end is that the parfunction() is 
> successful with some probability and then I want to return some object from 
> the calculations there, but in general I will not want to fetch anything. 
> What would be the "correct" way to do that? If I have the following code: 
>  
> function mainfunction(r)
>  
>     const n = 1000
>     A = SharedArray(Uint, (n, n))
>      for i = 1:n, j = 1:n
>         A[i, j] = rand(Uint)
>     end
>      s = SharedArray(Uint, (1))
>  
>     i = 1
>     nextidx() = (idx = i; i += 1; idx)
>  
>      println(s)
>     @sync begin
>         for p in workers()
>             @async begin
>                 while true
>                     idx = nextidx()
>                     if idx > r
>                         break
>                     end
>                      remotecall(p, parfunction, A, s)
>                 end
>             end
>         end
>     end
>     println(s)
> end
>  
> function parfunction(A::SharedArray{Uint, 2}, s::SharedArray{Uint, 1})
>     d = sum(A)
>     if rand(0:1000) == 0
>         println("success")
>         s[1] = d
>     end
> end
>  
> and run
> julia -p 2
> julia> reload("testpar.jl")                                        
> julia> @time mainfunction(5000)
>  
> I get ERROR: SharedArray cannot be used on a non-participating process, 
> although s should according to my logic be available on all processes (I'm 
> assuming it's s that's causing it because it's fine if I remove all traces 
> of s). 
>  
> On Thu, Mar 27, 2014, at 4:54, Amit Murthy wrote:
>
> I think the code does not do what you want.
>  
> In the non-shared case you are sending a 10^6 integer array over the 
> network 1000 times and summing it as many times. Most of the time is the 
> network traffic time. Reduce 'n' to say 10, and you will what I mean 
>  
> In the shared case you are not sending the array over the network but 
> still summing the entire array 1000 times. Some of the remotecall_fetch 
> calls seems to be taking 40 milli seconds extra time which adds to the 
> total.  
>  
> shared time of 6 seconds being less than the 15 seconds for non-shared 
> seems to be just incidental.
>  
> I don't yet have an explanation for the extra 40 millseconds per 
> remotecall_fetch (for some calls only) in the shared case.
>  
>  
>  
>  
>   
>  
> On Thu, Mar 27, 2014 at 2:50 PM, Mikael Simberg 
> <[email protected]<javascript:>
> > wrote:
>
> Hi,
>  I'm having some trouble figuring out exactly how I'm supposed to use
>  SharedArrays - I might just be misunderstanding them or else something
>  odd is happening with them.
>  
>  I'm trying to do some parallel computing which looks a bit like this
>  test case:
>  
>  function createdata(shared)
>      const n = 1000
>      if shared
>          A = SharedArray(Uint, (n, n))
>      else
>          A = Array(Uint, (n, n))
>      end
>      for i = 1:n, j = 1:n
>          A[i, j] = rand(Uint)
>      end
>  
>      return n, A
>  end
>  
>  function mainfunction(r; shared = false)
>      n, A = createdata(shared)
>  
>      i = 1
>      nextidx() = (idx = i; i += 1; idx)
>  
>      @sync begin
>          for p in workers()
>              @async begin
>                  while true
>                      idx = nextidx()
>                      if idx > r
>                          break
>                      end
>                      found, s = remotecall_fetch(p, parfunction, n, A)
>                  end
>              end
>          end
>      end
>  end
>  
>  function parfunction(n::Int, A::Array{Uint, 2})
>      # possibly do some other computation here independent of shared
>      arrays
>      s = sum(A)
>      return false, s
>  end
>  
>  function parfunction(n::Int, A::SharedArray{Uint, 2})
>      s = sum(A)
>      return false, s
>  end
>  
>  If I then start julia with e.g. two worker processes, so julia -p 2, the
>  following happens:
>  
>  julia> require("testpar.jl")
>  
>  julia> @time mainfunction(1000, shared = false)
>  elapsed time: 15.717117365 seconds (8448701068 bytes allocated)
>  
>  julia> @time mainfunction(1000, shared = true)
>  elapsed time: 6.068758627 seconds (56713996 bytes allocated)
>  
>  julia> rmprocs([2, 3])
>  :ok
>  
>  julia> @time mainfunction(1000, shared = false)
>  elapsed time: 0.717638344 seconds (40357664 bytes allocated)
>  
>  julia> @time mainfunction(1000, shared = true)
>  elapsed time: 0.702174085 seconds (32680628 bytes allocated)
>  
>  So, with a normal array it's slow as expected, and it is faster with the
>  shared array, but what seems to happen is that with the normal array cpu
>  usage is 100 % on two cores but with the shared array cpu usage spikes
>  for a fraction of a second and then for the remaining nearly 6 seconds
>  it's at around 10 %. Can anyone reproduce this? Am I just doing
>  something wrong with shared arrays.
>  
>  Slightly related note: is there now a way to create a random shared
>  array? https://github.com/JuliaLang/julia/pull/4939 and the latest docs
>  don't mention this.
>  
>
>  
>  
>    
>  
>

Reply via email to