Re: [julia-users] Julia Parallel Computing Optimization

Alex C Wed, 05 Feb 2014 06:59:43 -0800

Good catch! I thought I had checked that the operation was okay before I 
wrote the line of code.  Guess not.


On Tuesday, February 4, 2014 5:16:47 PM UTC-5, David Salamon wrote:
>
> Woah, also:
> A = B = zeros(Float64,Ly,Lx);
>
> is almost surely not what you intended.
>
> julia> A = B = [1 2]
> 1x2 Array{Int64,2}:
>  1  2
> julia> A[1] = 10
> 10
> julia> B
> 1x2 Array{Int64,2}:
>  10  2
>
>
> On Tue, Feb 4, 2014 at 2:15 PM, David Salamon <[email protected]<javascript:>
> > wrote:
>
>> 1. You should change:
>> C = complex(zeros(Float64, Ly, Lx)
>> to:
>> C = zeros(Complex{Float64}, Ly, Lx)
>> [the way you are doing it there creates a float version, then a complex 
>> version, then trashes it]
>>
>> 2. The algorithm after the above change allocates   3 * limit * (limit/2) 
>> * samples * 16   bytes in the @parallel loop
>> [3 being one allocation for A, B, and C ; 16 being 
>> sizeof(Complex{Float64})]
>> [without change #1 it's 4 * limit * (limit/2) * samples * 16]
>> [so yeah, makes sense that would grow fast as limit increases]
>> You could reduce this by 50% by running the paralel computation over D 
>> and E directly, then scaling it down afterwards. Which would also be 
>> faster, because you don't have to do that element-wise multiply or the 
>> element-wise scaling. You're walking D 3 times more often than you need to.
>>
>> 3. You could consider going Float32 (factor of 2 savings of course), or 
>> Float16 if those would meet your needs.
>>
>> 4. Style-wise, you should do a deconstructing bind instead of using 
>> tuphat:
>> (A, B, C) = @parallel (add_three_tuples) for ...
>>
>> or really:
>> (D, E) = @parallel (add_two_tuples) for ...
>>
>> 5. You'll probably get better numerical stability if you scale down prior 
>> to the addition step (so doing D[i,j] += Aij * Bij / samples / samples 
>> instead of D[i,j] += Aij * Bij and then dividing before returning)
>>
>>
>>
>> If you make the above changes (specifically 1 and 2) you'll have the 
>> allocation down to the minimum size possible for this approach to 
>> threading... ie (threads) * (size of returned data). If the thread factor 
>> remains a problem, it's not *so* bad compiling julia from source to get 
>> SharedArrays.
>>
>> The only other thing to consider would be how you are using the return 
>> value. Perhaps you could get away with computing a smaller sized return 
>> value? 
>>
>>
>> On Tue, Feb 4, 2014 at 1:49 PM, Alex C <[email protected] 
>> <javascript:>>wrote:
>>
>>> Sorry for the confusion. I was trying to get a simple example to work so 
>>> I wouldn't get distracted by details. The @everywhere did the trick.
>>> This is the fastest parallel version of the code that I was able to get 
>>> working. However, I easily run into memory limitations (8GB RAM) as I 
>>> increase the variable 'limit'.
>>>
>>> function expensive_hat(S::Array{Complex{Float64},2},limit::Int64)
>>>
>>>     samples = size(S,2);
>>>     Ly = int64(limit/4);
>>>     Lx = int64(limit/2);
>>>
>>>     @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>>
>>>     tuphat = @parallel (add_three_tuple) for k = 1:samples
>>>         A = B = zeros(Float64,Ly,Lx);
>>>         C = complex(zeros(Float64,Ly,Lx));
>>>         for j = 1:Lx, i = 1:Ly
>>>             Sx = S[i,k];
>>>             Sy = S[j,k];
>>>             Sxy = S[i+j,k];
>>>             Sxyc = conj(Sxy);
>>>             A[i,j] += abs2(Sy*Sx);
>>>             B[i,j] += abs2(sqrt(Sxyc*Sxy));
>>>             C[i,j] += Sxyc*Sy*Sx;
>>>         end
>>>         (A, B, C)
>>>     end
>>>
>>>     D = tuphat[1].*tuphat[2]/samples/samples;
>>>     E = tuphat[3]/samples;
>>>      return (D, E);
>>>
>>> end
>>>
>>> srand(77);
>>> data = rand(24000,64);
>>> limit = 4000;
>>> S = rfft(data,1)./24000;
>>>
>>> @time (D, E) = expensive_hat(S,limit);
>>>
>>>
>>>
>>>
>>> On Tuesday, February 4, 2014 3:01:17 PM UTC-5, David Salamon wrote:
>>>
>>>> huh.
>>>>
>>>> maybe @everywhere in front of the function definition? I'm not sure
>>>>
>>>>
>>>> On Tue, Feb 4, 2014 at 10:53 AM, Alex C <[email protected]> wrote:
>>>>
>>>>> Thanks for the hint. Getting rid of 'mx' and 'my' definitely helps.
>>>>>
>>>>> I couldn't figure out how to implement the parallel version of tuple 
>>>>> adding. This is what I've got. It crashes my Julia Studio console when I 
>>>>> try to run it. What am I missing?
>>>>>
>>>>> add_two_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>>>>
>>>>>
>>>>> @parallel (add_two_tuple) for i = 1:100
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     x = (1,2,3)
>>>>>
>>>>> end
>>>>>
>>>>>
>>>>> I also don't have the SharedArray function (running Julia 0.2.0) so I 
>>>>> couldn't implement the other alternative.
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>> On Monday, February 3, 2014 11:45:57 AM UTC-5, David Salamon wrote:
>>>>>
>>>>>> You're not out of the no-slicing woods yet. Looks like you can get 
>>>>>> rid of `mx` and `my`
>>>>>>
>>>>>> for i=1:limit, j=1:int64(limit/2)
>>>>>> end
>>>>>>
>>>>>>
>>>>>>
>>>>>> As far as parallelizing, you could define:
>>>>>> three_tup_add(a, b, c) = (a[1] + b[1] + c[1], a[2] + b[2] + c[2], 
>>>>>> a[3] + b[3] + c[3])
>>>>>>
>>>>>> and then do a @parallel (three_tup_add) over your sample index?
>>>>>>
>>>>>> for that matter, why not compute the two parts of the answer directly 
>>>>>> rather than going via A, B, and C?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>>
>>>>>> On Mon, Feb 3, 2014 at 8:11 AM, Alex C <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks. I've re-written the function to minimize the amount of 
>>>>>>> copying (i.e. slicing) that is required. But now, I'm befuddled as to 
>>>>>>> how 
>>>>>>> to parallelize this function using Julia. Any suggestions?
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> function expensive_hat(S::Array{Complex{Float64},2}, 
>>>>>>> mx::Array{Int64,2}, my::Array{Int64,2})
>>>>>>>
>>>>>>>     samples = 64
>>>>>>>         A = zeros(size(mx));
>>>>>>>     B = zeros(size(mx));
>>>>>>>     C = zeros(size(mx));
>>>>>>>
>>>>>>>     for i = 1:samples
>>>>>>>         Si = S[:,i];
>>>>>>>         Sx = Si[mx];
>>>>>>>         Sy = Si[my];
>>>>>>>         Sxy = Si[mx+my];
>>>>>>>         Sxyc = conj(Sxy);
>>>>>>>
>>>>>>>                 A +=  abs2(Sy .* Sx);
>>>>>>>         B += abs2(sqrt(Sxyc .* Sxy));
>>>>>>>         C += Sxyc .* Sy .* Sx;
>>>>>>>     end
>>>>>>>
>>>>>>>         ans = (A .* B ./ samples ./ samples, C./samples)
>>>>>>>     return ans
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>> data = rand(24000,64);
>>>>>>> limit = 2000;
>>>>>>>
>>>>>>> ix = int64([1:limit/2]);
>>>>>>> iy = ix[1:end/2];
>>>>>>> mg = zeros(Int64,length(iy),length(ix));
>>>>>>> mx = broadcast(+,ix',mg);
>>>>>>> my = broadcast(+,iy,mg);
>>>>>>> S = rfft(data,1)./24000;
>>>>>>>
>>>>>>> @time (AB, C) = expensive_hat(S,mx,my);
>>>>>>>  
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>
>

Re: [julia-users] Julia Parallel Computing Optimization

Reply via email to