This is why, with my original implementation of SharedArrays (oh-so-long-ago), I created pmap_bw, to do busy-wait on the return value of a SharedArray computation. The amusing part is that you can use a SharedArray to do the synchronization among processes.
--Tim On Thursday, March 27, 2014 06:11:12 PM Amit Murthy wrote: > There is a pattern here. For a set of pids, the cumulative sum is 40 > milliseconds. In a SharedArray, RemoteRefs are maintained on the creating > pid (in this case 1) to the shmem mappings on each of the workers. I think > they are referring back to pid 1 to fetch the local mapping when the shared > array object is passed in the remotecall_fetch call, and hence all the > workers are stuck on pid 1 becoming free to service these calls. > > On Thu, Mar 27, 2014 at 5:58 PM, Amit Murthy <[email protected]> wrote: > > Some more weirdness > > > > Starting with julia -p 8 > > > > A=Base.shmem_fill(1, (1000,1000)) > > > > Using 2 workers: > > for i in 1:100 > > > > t1 = time(); p=2+(i%2); remotecall_fetch(p, x->1, A); t2=time(); > > > > println("@ $p ", int((t2-t1) * 1000)) > > end > > > > prints > > > > ... > > @ 3 8 > > @ 2 32 > > @ 3 8 > > @ 2 32 > > @ 3 8 > > @ 2 32 > > @ 3 8 > > @ 2 32 > > > > > > Notice that pid 2 always takes 32 milliseconds while pid 3 always takes 8 > > > > > > > > With 4 workers: > > > > for i in 1:100 > > > > t1 = time(); p=2+(i%4); remotecall_fetch(p, x->1, A); t2=time(); > > > > println("@ $p ", int((t2-t1) * 1000)) > > end > > > > ... > > @ 2 31 > > @ 3 4 > > @ 4 4 > > @ 5 1 > > @ 2 31 > > @ 3 4 > > @ 4 4 > > @ 5 1 > > @ 2 31 > > @ 3 4 > > @ 4 4 > > @ 5 1 > > @ 2 31 > > > > > > Now pid 2 always takes 31 millisecs, pids 3&4, 4 and pid 5 1 millisecond > > > > With 8 workers: > > > > for i in 1:100 > > > > t1 = time(); p=2+(i%8); remotecall_fetch(p, x->1, A); t2=time(); > > > > println("@ $p ", int((t2-t1) * 1000)) > > end > > > > .... > > @ 2 20 > > @ 3 4 > > @ 4 1 > > @ 5 3 > > @ 6 4 > > @ 7 1 > > @ 8 2 > > @ 9 4 > > @ 2 20 > > @ 3 4 > > @ 4 1 > > @ 5 3 > > @ 6 4 > > @ 7 1 > > @ 8 2 > > @ 9 4 > > @ 2 20 > > @ 3 4 > > @ 4 1 > > @ 5 3 > > @ 6 4 > > @ 7 1 > > @ 8 3 > > @ 9 4 > > @ 2 20 > > @ 3 4 > > @ 4 1 > > @ 5 3 > > @ 6 4 > > > > > > pid 2 is always 20 milliseconds while the rest are pretty consistent too. > > > > Any explanations? > > > > On Thu, Mar 27, 2014 at 5:24 PM, Amit Murthy <[email protected]>wrote: > >> I think the code does not do what you want. > >> > >> In the non-shared case you are sending a 10^6 integer array over the > >> network 1000 times and summing it as many times. Most of the time is the > >> network traffic time. Reduce 'n' to say 10, and you will what I mean > >> > >> In the shared case you are not sending the array over the network but > >> still summing the entire array 1000 times. Some of the remotecall_fetch > >> calls seems to be taking 40 milli seconds extra time which adds to the > >> total. > >> > >> shared time of 6 seconds being less than the 15 seconds for non-shared > >> seems to be just incidental. > >> > >> I don't yet have an explanation for the extra 40 millseconds per > >> remotecall_fetch (for some calls only) in the shared case. > >> > >> On Thu, Mar 27, 2014 at 2:50 PM, Mikael Simberg <[email protected]>wrote: > >>> Hi, > >>> I'm having some trouble figuring out exactly how I'm supposed to use > >>> SharedArrays - I might just be misunderstanding them or else something > >>> odd is happening with them. > >>> > >>> I'm trying to do some parallel computing which looks a bit like this > >>> test case: > >>> > >>> function createdata(shared) > >>> > >>> const n = 1000 > >>> if shared > >>> > >>> A = SharedArray(Uint, (n, n)) > >>> > >>> else > >>> > >>> A = Array(Uint, (n, n)) > >>> > >>> end > >>> for i = 1:n, j = 1:n > >>> > >>> A[i, j] = rand(Uint) > >>> > >>> end > >>> > >>> return n, A > >>> > >>> end > >>> > >>> function mainfunction(r; shared = false) > >>> > >>> n, A = createdata(shared) > >>> > >>> i = 1 > >>> nextidx() = (idx = i; i += 1; idx) > >>> > >>> @sync begin > >>> > >>> for p in workers() > >>> > >>> @async begin > >>> > >>> while true > >>> > >>> idx = nextidx() > >>> if idx > r > >>> > >>> break > >>> > >>> end > >>> found, s = remotecall_fetch(p, parfunction, n, A) > >>> > >>> end > >>> > >>> end > >>> > >>> end > >>> > >>> end > >>> > >>> end > >>> > >>> function parfunction(n::Int, A::Array{Uint, 2}) > >>> > >>> # possibly do some other computation here independent of shared > >>> arrays > >>> s = sum(A) > >>> return false, s > >>> > >>> end > >>> > >>> function parfunction(n::Int, A::SharedArray{Uint, 2}) > >>> > >>> s = sum(A) > >>> return false, s > >>> > >>> end > >>> > >>> If I then start julia with e.g. two worker processes, so julia -p 2, the > >>> following happens: > >>> > >>> julia> require("testpar.jl") > >>> > >>> julia> @time mainfunction(1000, shared = false) > >>> elapsed time: 15.717117365 seconds (8448701068 bytes allocated) > >>> > >>> julia> @time mainfunction(1000, shared = true) > >>> elapsed time: 6.068758627 seconds (56713996 bytes allocated) > >>> > >>> julia> rmprocs([2, 3]) > >>> > >>> :ok > >>> > >>> julia> @time mainfunction(1000, shared = false) > >>> elapsed time: 0.717638344 seconds (40357664 bytes allocated) > >>> > >>> julia> @time mainfunction(1000, shared = true) > >>> elapsed time: 0.702174085 seconds (32680628 bytes allocated) > >>> > >>> So, with a normal array it's slow as expected, and it is faster with the > >>> shared array, but what seems to happen is that with the normal array cpu > >>> usage is 100 % on two cores but with the shared array cpu usage spikes > >>> for a fraction of a second and then for the remaining nearly 6 seconds > >>> it's at around 10 %. Can anyone reproduce this? Am I just doing > >>> something wrong with shared arrays. > >>> > >>> Slightly related note: is there now a way to create a random shared > >>> array? https://github.com/JuliaLang/julia/pull/4939 and the latest docs > >>> don't mention this.
