There is a pattern here. For a set of pids, the cumulative sum is 40
milliseconds. In a SharedArray, RemoteRefs are maintained on the creating
pid (in this case 1) to the shmem mappings on each of the workers. I think
they are referring back to pid 1 to fetch the local mapping when the shared
array object is passed in the remotecall_fetch call, and hence all the
workers are stuck on pid 1 becoming free to service these calls.


On Thu, Mar 27, 2014 at 5:58 PM, Amit Murthy <[email protected]> wrote:

> Some more weirdness
>
> Starting with julia -p 8
>
> A=Base.shmem_fill(1, (1000,1000))
>
> Using 2 workers:
> for i in 1:100
>          t1 = time(); p=2+(i%2); remotecall_fetch(p, x->1, A); t2=time();
> println("@ $p ", int((t2-t1) * 1000))
> end
>
> prints
>
> ...
> @ 3 8
> @ 2 32
> @ 3 8
> @ 2 32
> @ 3 8
> @ 2 32
> @ 3 8
> @ 2 32
>
>
> Notice that pid 2 always takes 32 milliseconds while pid 3 always takes 8
>
>
>
> With 4 workers:
>
> for i in 1:100
>          t1 = time(); p=2+(i%4); remotecall_fetch(p, x->1, A); t2=time();
> println("@ $p ", int((t2-t1) * 1000))
> end
>
> ...
> @ 2 31
> @ 3 4
> @ 4 4
> @ 5 1
> @ 2 31
> @ 3 4
> @ 4 4
> @ 5 1
> @ 2 31
> @ 3 4
> @ 4 4
> @ 5 1
> @ 2 31
>
>
> Now pid 2 always takes 31 millisecs, pids 3&4, 4 and pid 5 1 millisecond
>
> With 8 workers:
>
> for i in 1:100
>          t1 = time(); p=2+(i%8); remotecall_fetch(p, x->1, A); t2=time();
> println("@ $p ", int((t2-t1) * 1000))
> end
>
> ....
> @ 2 20
> @ 3 4
> @ 4 1
> @ 5 3
> @ 6 4
> @ 7 1
> @ 8 2
> @ 9 4
> @ 2 20
> @ 3 4
> @ 4 1
> @ 5 3
> @ 6 4
> @ 7 1
> @ 8 2
> @ 9 4
> @ 2 20
> @ 3 4
> @ 4 1
> @ 5 3
> @ 6 4
> @ 7 1
> @ 8 3
> @ 9 4
> @ 2 20
> @ 3 4
> @ 4 1
> @ 5 3
> @ 6 4
>
>
> pid 2 is always 20 milliseconds while the rest are pretty consistent too.
>
> Any explanations?
>
>
>
>
>
>
>
> On Thu, Mar 27, 2014 at 5:24 PM, Amit Murthy <[email protected]>wrote:
>
>> I think the code does not do what you want.
>>
>> In the non-shared case you are sending a 10^6 integer array over the
>> network 1000 times and summing it as many times. Most of the time is the
>> network traffic time. Reduce 'n' to say 10, and you will what I mean
>>
>> In the shared case you are not sending the array over the network but
>> still summing the entire array 1000 times. Some of the remotecall_fetch
>> calls seems to be taking 40 milli seconds extra time which adds to the
>> total.
>>
>> shared time of 6 seconds being less than the 15 seconds for non-shared
>> seems to be just incidental.
>>
>> I don't yet have an explanation for the extra 40 millseconds per
>> remotecall_fetch (for some calls only) in the shared case.
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 27, 2014 at 2:50 PM, Mikael Simberg <[email protected]>wrote:
>>
>>> Hi,
>>> I'm having some trouble figuring out exactly how I'm supposed to use
>>> SharedArrays - I might just be misunderstanding them or else something
>>> odd is happening with them.
>>>
>>> I'm trying to do some parallel computing which looks a bit like this
>>> test case:
>>>
>>> function createdata(shared)
>>>     const n = 1000
>>>     if shared
>>>         A = SharedArray(Uint, (n, n))
>>>     else
>>>         A = Array(Uint, (n, n))
>>>     end
>>>     for i = 1:n, j = 1:n
>>>         A[i, j] = rand(Uint)
>>>     end
>>>
>>>     return n, A
>>> end
>>>
>>> function mainfunction(r; shared = false)
>>>     n, A = createdata(shared)
>>>
>>>     i = 1
>>>     nextidx() = (idx = i; i += 1; idx)
>>>
>>>     @sync begin
>>>         for p in workers()
>>>             @async begin
>>>                 while true
>>>                     idx = nextidx()
>>>                     if idx > r
>>>                         break
>>>                     end
>>>                     found, s = remotecall_fetch(p, parfunction, n, A)
>>>                 end
>>>             end
>>>         end
>>>     end
>>> end
>>>
>>> function parfunction(n::Int, A::Array{Uint, 2})
>>>     # possibly do some other computation here independent of shared
>>>     arrays
>>>     s = sum(A)
>>>     return false, s
>>> end
>>>
>>> function parfunction(n::Int, A::SharedArray{Uint, 2})
>>>     s = sum(A)
>>>     return false, s
>>> end
>>>
>>> If I then start julia with e.g. two worker processes, so julia -p 2, the
>>> following happens:
>>>
>>> julia> require("testpar.jl")
>>>
>>> julia> @time mainfunction(1000, shared = false)
>>> elapsed time: 15.717117365 seconds (8448701068 bytes allocated)
>>>
>>> julia> @time mainfunction(1000, shared = true)
>>> elapsed time: 6.068758627 seconds (56713996 bytes allocated)
>>>
>>> julia> rmprocs([2, 3])
>>> :ok
>>>
>>> julia> @time mainfunction(1000, shared = false)
>>> elapsed time: 0.717638344 seconds (40357664 bytes allocated)
>>>
>>> julia> @time mainfunction(1000, shared = true)
>>> elapsed time: 0.702174085 seconds (32680628 bytes allocated)
>>>
>>> So, with a normal array it's slow as expected, and it is faster with the
>>> shared array, but what seems to happen is that with the normal array cpu
>>> usage is 100 % on two cores but with the shared array cpu usage spikes
>>> for a fraction of a second and then for the remaining nearly 6 seconds
>>> it's at around 10 %. Can anyone reproduce this? Am I just doing
>>> something wrong with shared arrays.
>>>
>>> Slightly related note: is there now a way to create a random shared
>>> array? https://github.com/JuliaLang/julia/pull/4939 and the latest docs
>>> don't mention this.
>>>
>>
>>
>

Reply via email to