Hi Tim, The issue of the extra 40 milliseconds is related to how RemoteRefs to the individuals mappings are fetched. I don't quite get how pmap_bw is related to this....
On Thu, Mar 27, 2014 at 6:27 PM, Tim Holy <[email protected]> wrote: > This is why, with my original implementation of SharedArrays > (oh-so-long-ago), > I created pmap_bw, to do busy-wait on the return value of a SharedArray > computation. The amusing part is that you can use a SharedArray to do the > synchronization among processes. > > --Tim > > On Thursday, March 27, 2014 06:11:12 PM Amit Murthy wrote: > > There is a pattern here. For a set of pids, the cumulative sum is 40 > > milliseconds. In a SharedArray, RemoteRefs are maintained on the creating > > pid (in this case 1) to the shmem mappings on each of the workers. I > think > > they are referring back to pid 1 to fetch the local mapping when the > shared > > array object is passed in the remotecall_fetch call, and hence all the > > workers are stuck on pid 1 becoming free to service these calls. > > > > On Thu, Mar 27, 2014 at 5:58 PM, Amit Murthy <[email protected]> > wrote: > > > Some more weirdness > > > > > > Starting with julia -p 8 > > > > > > A=Base.shmem_fill(1, (1000,1000)) > > > > > > Using 2 workers: > > > for i in 1:100 > > > > > > t1 = time(); p=2+(i%2); remotecall_fetch(p, x->1, A); > t2=time(); > > > > > > println("@ $p ", int((t2-t1) * 1000)) > > > end > > > > > > prints > > > > > > ... > > > @ 3 8 > > > @ 2 32 > > > @ 3 8 > > > @ 2 32 > > > @ 3 8 > > > @ 2 32 > > > @ 3 8 > > > @ 2 32 > > > > > > > > > Notice that pid 2 always takes 32 milliseconds while pid 3 always > takes 8 > > > > > > > > > > > > With 4 workers: > > > > > > for i in 1:100 > > > > > > t1 = time(); p=2+(i%4); remotecall_fetch(p, x->1, A); > t2=time(); > > > > > > println("@ $p ", int((t2-t1) * 1000)) > > > end > > > > > > ... > > > @ 2 31 > > > @ 3 4 > > > @ 4 4 > > > @ 5 1 > > > @ 2 31 > > > @ 3 4 > > > @ 4 4 > > > @ 5 1 > > > @ 2 31 > > > @ 3 4 > > > @ 4 4 > > > @ 5 1 > > > @ 2 31 > > > > > > > > > Now pid 2 always takes 31 millisecs, pids 3&4, 4 and pid 5 1 > millisecond > > > > > > With 8 workers: > > > > > > for i in 1:100 > > > > > > t1 = time(); p=2+(i%8); remotecall_fetch(p, x->1, A); > t2=time(); > > > > > > println("@ $p ", int((t2-t1) * 1000)) > > > end > > > > > > .... > > > @ 2 20 > > > @ 3 4 > > > @ 4 1 > > > @ 5 3 > > > @ 6 4 > > > @ 7 1 > > > @ 8 2 > > > @ 9 4 > > > @ 2 20 > > > @ 3 4 > > > @ 4 1 > > > @ 5 3 > > > @ 6 4 > > > @ 7 1 > > > @ 8 2 > > > @ 9 4 > > > @ 2 20 > > > @ 3 4 > > > @ 4 1 > > > @ 5 3 > > > @ 6 4 > > > @ 7 1 > > > @ 8 3 > > > @ 9 4 > > > @ 2 20 > > > @ 3 4 > > > @ 4 1 > > > @ 5 3 > > > @ 6 4 > > > > > > > > > pid 2 is always 20 milliseconds while the rest are pretty consistent > too. > > > > > > Any explanations? > > > > > > On Thu, Mar 27, 2014 at 5:24 PM, Amit Murthy <[email protected] > >wrote: > > >> I think the code does not do what you want. > > >> > > >> In the non-shared case you are sending a 10^6 integer array over the > > >> network 1000 times and summing it as many times. Most of the time is > the > > >> network traffic time. Reduce 'n' to say 10, and you will what I mean > > >> > > >> In the shared case you are not sending the array over the network but > > >> still summing the entire array 1000 times. Some of the > remotecall_fetch > > >> calls seems to be taking 40 milli seconds extra time which adds to the > > >> total. > > >> > > >> shared time of 6 seconds being less than the 15 seconds for non-shared > > >> seems to be just incidental. > > >> > > >> I don't yet have an explanation for the extra 40 millseconds per > > >> remotecall_fetch (for some calls only) in the shared case. > > >> > > >> On Thu, Mar 27, 2014 at 2:50 PM, Mikael Simberg > <[email protected]>wrote: > > >>> Hi, > > >>> I'm having some trouble figuring out exactly how I'm supposed to use > > >>> SharedArrays - I might just be misunderstanding them or else > something > > >>> odd is happening with them. > > >>> > > >>> I'm trying to do some parallel computing which looks a bit like this > > >>> test case: > > >>> > > >>> function createdata(shared) > > >>> > > >>> const n = 1000 > > >>> if shared > > >>> > > >>> A = SharedArray(Uint, (n, n)) > > >>> > > >>> else > > >>> > > >>> A = Array(Uint, (n, n)) > > >>> > > >>> end > > >>> for i = 1:n, j = 1:n > > >>> > > >>> A[i, j] = rand(Uint) > > >>> > > >>> end > > >>> > > >>> return n, A > > >>> > > >>> end > > >>> > > >>> function mainfunction(r; shared = false) > > >>> > > >>> n, A = createdata(shared) > > >>> > > >>> i = 1 > > >>> nextidx() = (idx = i; i += 1; idx) > > >>> > > >>> @sync begin > > >>> > > >>> for p in workers() > > >>> > > >>> @async begin > > >>> > > >>> while true > > >>> > > >>> idx = nextidx() > > >>> if idx > r > > >>> > > >>> break > > >>> > > >>> end > > >>> found, s = remotecall_fetch(p, parfunction, n, A) > > >>> > > >>> end > > >>> > > >>> end > > >>> > > >>> end > > >>> > > >>> end > > >>> > > >>> end > > >>> > > >>> function parfunction(n::Int, A::Array{Uint, 2}) > > >>> > > >>> # possibly do some other computation here independent of shared > > >>> arrays > > >>> s = sum(A) > > >>> return false, s > > >>> > > >>> end > > >>> > > >>> function parfunction(n::Int, A::SharedArray{Uint, 2}) > > >>> > > >>> s = sum(A) > > >>> return false, s > > >>> > > >>> end > > >>> > > >>> If I then start julia with e.g. two worker processes, so julia -p 2, > the > > >>> following happens: > > >>> > > >>> julia> require("testpar.jl") > > >>> > > >>> julia> @time mainfunction(1000, shared = false) > > >>> elapsed time: 15.717117365 seconds (8448701068 bytes allocated) > > >>> > > >>> julia> @time mainfunction(1000, shared = true) > > >>> elapsed time: 6.068758627 seconds (56713996 bytes allocated) > > >>> > > >>> julia> rmprocs([2, 3]) > > >>> > > >>> :ok > > >>> > > >>> julia> @time mainfunction(1000, shared = false) > > >>> elapsed time: 0.717638344 seconds (40357664 bytes allocated) > > >>> > > >>> julia> @time mainfunction(1000, shared = true) > > >>> elapsed time: 0.702174085 seconds (32680628 bytes allocated) > > >>> > > >>> So, with a normal array it's slow as expected, and it is faster with > the > > >>> shared array, but what seems to happen is that with the normal array > cpu > > >>> usage is 100 % on two cores but with the shared array cpu usage > spikes > > >>> for a fraction of a second and then for the remaining nearly 6 > seconds > > >>> it's at around 10 %. Can anyone reproduce this? Am I just doing > > >>> something wrong with shared arrays. > > >>> > > >>> Slightly related note: is there now a way to create a random shared > > >>> array? https://github.com/JuliaLang/julia/pull/4939 and the latest > docs > > >>> don't mention this. >
