No explanation for the uneven distribution of the 40 milliseconds though.
On Thu, Mar 27, 2014 at 6:11 PM, Amit Murthy <[email protected]> wrote: > There is a pattern here. For a set of pids, the cumulative sum is 40 > milliseconds. In a SharedArray, RemoteRefs are maintained on the creating > pid (in this case 1) to the shmem mappings on each of the workers. I think > they are referring back to pid 1 to fetch the local mapping when the shared > array object is passed in the remotecall_fetch call, and hence all the > workers are stuck on pid 1 becoming free to service these calls. > > > On Thu, Mar 27, 2014 at 5:58 PM, Amit Murthy <[email protected]>wrote: > >> Some more weirdness >> >> Starting with julia -p 8 >> >> A=Base.shmem_fill(1, (1000,1000)) >> >> Using 2 workers: >> for i in 1:100 >> t1 = time(); p=2+(i%2); remotecall_fetch(p, x->1, A); t2=time(); >> println("@ $p ", int((t2-t1) * 1000)) >> end >> >> prints >> >> ... >> @ 3 8 >> @ 2 32 >> @ 3 8 >> @ 2 32 >> @ 3 8 >> @ 2 32 >> @ 3 8 >> @ 2 32 >> >> >> Notice that pid 2 always takes 32 milliseconds while pid 3 always takes 8 >> >> >> >> With 4 workers: >> >> for i in 1:100 >> t1 = time(); p=2+(i%4); remotecall_fetch(p, x->1, A); t2=time(); >> println("@ $p ", int((t2-t1) * 1000)) >> end >> >> ... >> @ 2 31 >> @ 3 4 >> @ 4 4 >> @ 5 1 >> @ 2 31 >> @ 3 4 >> @ 4 4 >> @ 5 1 >> @ 2 31 >> @ 3 4 >> @ 4 4 >> @ 5 1 >> @ 2 31 >> >> >> Now pid 2 always takes 31 millisecs, pids 3&4, 4 and pid 5 1 millisecond >> >> With 8 workers: >> >> for i in 1:100 >> t1 = time(); p=2+(i%8); remotecall_fetch(p, x->1, A); t2=time(); >> println("@ $p ", int((t2-t1) * 1000)) >> end >> >> .... >> @ 2 20 >> @ 3 4 >> @ 4 1 >> @ 5 3 >> @ 6 4 >> @ 7 1 >> @ 8 2 >> @ 9 4 >> @ 2 20 >> @ 3 4 >> @ 4 1 >> @ 5 3 >> @ 6 4 >> @ 7 1 >> @ 8 2 >> @ 9 4 >> @ 2 20 >> @ 3 4 >> @ 4 1 >> @ 5 3 >> @ 6 4 >> @ 7 1 >> @ 8 3 >> @ 9 4 >> @ 2 20 >> @ 3 4 >> @ 4 1 >> @ 5 3 >> @ 6 4 >> >> >> pid 2 is always 20 milliseconds while the rest are pretty consistent too. >> >> Any explanations? >> >> >> >> >> >> >> >> On Thu, Mar 27, 2014 at 5:24 PM, Amit Murthy <[email protected]>wrote: >> >>> I think the code does not do what you want. >>> >>> In the non-shared case you are sending a 10^6 integer array over the >>> network 1000 times and summing it as many times. Most of the time is the >>> network traffic time. Reduce 'n' to say 10, and you will what I mean >>> >>> In the shared case you are not sending the array over the network but >>> still summing the entire array 1000 times. Some of the remotecall_fetch >>> calls seems to be taking 40 milli seconds extra time which adds to the >>> total. >>> >>> shared time of 6 seconds being less than the 15 seconds for non-shared >>> seems to be just incidental. >>> >>> I don't yet have an explanation for the extra 40 millseconds per >>> remotecall_fetch (for some calls only) in the shared case. >>> >>> >>> >>> >>> >>> >>> On Thu, Mar 27, 2014 at 2:50 PM, Mikael Simberg >>> <[email protected]>wrote: >>> >>>> Hi, >>>> I'm having some trouble figuring out exactly how I'm supposed to use >>>> SharedArrays - I might just be misunderstanding them or else something >>>> odd is happening with them. >>>> >>>> I'm trying to do some parallel computing which looks a bit like this >>>> test case: >>>> >>>> function createdata(shared) >>>> const n = 1000 >>>> if shared >>>> A = SharedArray(Uint, (n, n)) >>>> else >>>> A = Array(Uint, (n, n)) >>>> end >>>> for i = 1:n, j = 1:n >>>> A[i, j] = rand(Uint) >>>> end >>>> >>>> return n, A >>>> end >>>> >>>> function mainfunction(r; shared = false) >>>> n, A = createdata(shared) >>>> >>>> i = 1 >>>> nextidx() = (idx = i; i += 1; idx) >>>> >>>> @sync begin >>>> for p in workers() >>>> @async begin >>>> while true >>>> idx = nextidx() >>>> if idx > r >>>> break >>>> end >>>> found, s = remotecall_fetch(p, parfunction, n, A) >>>> end >>>> end >>>> end >>>> end >>>> end >>>> >>>> function parfunction(n::Int, A::Array{Uint, 2}) >>>> # possibly do some other computation here independent of shared >>>> arrays >>>> s = sum(A) >>>> return false, s >>>> end >>>> >>>> function parfunction(n::Int, A::SharedArray{Uint, 2}) >>>> s = sum(A) >>>> return false, s >>>> end >>>> >>>> If I then start julia with e.g. two worker processes, so julia -p 2, the >>>> following happens: >>>> >>>> julia> require("testpar.jl") >>>> >>>> julia> @time mainfunction(1000, shared = false) >>>> elapsed time: 15.717117365 seconds (8448701068 bytes allocated) >>>> >>>> julia> @time mainfunction(1000, shared = true) >>>> elapsed time: 6.068758627 seconds (56713996 bytes allocated) >>>> >>>> julia> rmprocs([2, 3]) >>>> :ok >>>> >>>> julia> @time mainfunction(1000, shared = false) >>>> elapsed time: 0.717638344 seconds (40357664 bytes allocated) >>>> >>>> julia> @time mainfunction(1000, shared = true) >>>> elapsed time: 0.702174085 seconds (32680628 bytes allocated) >>>> >>>> So, with a normal array it's slow as expected, and it is faster with the >>>> shared array, but what seems to happen is that with the normal array cpu >>>> usage is 100 % on two cores but with the shared array cpu usage spikes >>>> for a fraction of a second and then for the remaining nearly 6 seconds >>>> it's at around 10 %. Can anyone reproduce this? Am I just doing >>>> something wrong with shared arrays. >>>> >>>> Slightly related note: is there now a way to create a random shared >>>> array? https://github.com/JuliaLang/julia/pull/4939 and the latest docs >>>> don't mention this. >>>> >>> >>> >> >
