Re: [julia-users] Julia Parallel Computing Optimization

Alex C Wed, 05 Feb 2014 07:29:18 -0800

1. Done
2. I am not sure what you mean by "running the paralel computation over D 
and E directly, then scaling it down afterwards" but I do the scaling 
element-wise now instead of at the end
3. I will consider it in the future
4. Done although it won't allow me to name the output (A,B,C). It gives an 
ErrorException("A not defined")
5. I don't think I can do that operation element-wise since it is not 
mathematically equivalent. That is sum(A)*sum(B) != sum(A*B)  [omitting the 
scaling part].  The multiplicative step has to be done after the summation. 
This could be done element-wise if samples becomes the inner loop.


The latest version of parallel code does run slightly faster than 
previously. 
The elements in the return values are independent of one another so I could 
break the matrix into pieces and compute it piecewise. I didn't want to go 
down that route quite yet.

Alex

function expensive_hat(S::Array{Complex{Float64},2},limit::Int64)

    samples = size(S,2);
    Ly = int64(limit/4);
    Lx = int64(limit/2);

    @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])

    (a,b,e) = @parallel (add_three_tuple) for k = 1:samples
        A = zeros(Float64,Ly,Lx);
        B = zeros(Float64,Ly,Lx);
        E = zeros(Complex{Float64},Ly,Lx);
        for j = 1:Lx, i = 1:Ly
            Sx = S[i,k];
            Sy = S[j,k];
            Sxy = S[i+j,k];
            Sxyc = conj(Sxy);
            A[i,j] += abs2(Sy*Sx)/samples;
            B[i,j] += abs2(sqrt(Sxyc*Sxy))/samples;
            E[i,j] += Sxyc*Sy*Sx/samples;
        end
        (A,B,E)
    end

    d = a.*b;
    return (d, e)

end

srand(77);
data = rand(24000,64);
limit = 4000;
S = rfft(data,1)./24000;

@time (D, E) = expensive_hat(S,limit);



On Tuesday, February 4, 2014 5:15:08 PM UTC-5, David Salamon wrote:
>
> 1. You should change:
> C = complex(zeros(Float64, Ly, Lx)
> to:
> C = zeros(Complex{Float64}, Ly, Lx)
> [the way you are doing it there creates a float version, then a complex 
> version, then trashes it]
>
> 2. The algorithm after the above change allocates   3 * limit * (limit/2) 
> * samples * 16   bytes in the @parallel loop
> [3 being one allocation for A, B, and C ; 16 being 
> sizeof(Complex{Float64})]
> [without change #1 it's 4 * limit * (limit/2) * samples * 16]
> [so yeah, makes sense that would grow fast as limit increases]
> You could reduce this by 50% by running the paralel computation over D and 
> E directly, then scaling it down afterwards. Which would also be faster, 
> because you don't have to do that element-wise multiply or the element-wise 
> scaling. You're walking D 3 times more often than you need to.
>
> 3. You could consider going Float32 (factor of 2 savings of course), or 
> Float16 if those would meet your needs.
>
> 4. Style-wise, you should do a deconstructing bind instead of using tuphat:
> (A, B, C) = @parallel (add_three_tuples) for ...
>
> or really:
> (D, E) = @parallel (add_two_tuples) for ...
>
> 5. You'll probably get better numerical stability if you scale down prior 
> to the addition step (so doing D[i,j] += Aij * Bij / samples / samples 
> instead of D[i,j] += Aij * Bij and then dividing before returning)
>
>
>
> If you make the above changes (specifically 1 and 2) you'll have the 
> allocation down to the minimum size possible for this approach to 
> threading... ie (threads) * (size of returned data). If the thread factor 
> remains a problem, it's not *so* bad compiling julia from source to get 
> SharedArrays.
>
> The only other thing to consider would be how you are using the return 
> value. Perhaps you could get away with computing a smaller sized return 
> value? 
>
>
> On Tue, Feb 4, 2014 at 1:49 PM, Alex C <[email protected] <javascript:>>wrote:
>
>> Sorry for the confusion. I was trying to get a simple example to work so 
>> I wouldn't get distracted by details. The @everywhere did the trick.
>> This is the fastest parallel version of the code that I was able to get 
>> working. However, I easily run into memory limitations (8GB RAM) as I 
>> increase the variable 'limit'.
>>
>> function expensive_hat(S::Array{Complex{Float64},2},limit::Int64)
>>
>>     samples = size(S,2);
>>     Ly = int64(limit/4);
>>     Lx = int64(limit/2);
>>
>>     @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>
>>     tuphat = @parallel (add_three_tuple) for k = 1:samples
>>         A = B = zeros(Float64,Ly,Lx);
>>         C = complex(zeros(Float64,Ly,Lx));
>>         for j = 1:Lx, i = 1:Ly
>>             Sx = S[i,k];
>>             Sy = S[j,k];
>>             Sxy = S[i+j,k];
>>             Sxyc = conj(Sxy);
>>             A[i,j] += abs2(Sy*Sx);
>>             B[i,j] += abs2(sqrt(Sxyc*Sxy));
>>             C[i,j] += Sxyc*Sy*Sx;
>>         end
>>         (A, B, C)
>>     end
>>
>>     D = tuphat[1].*tuphat[2]/samples/samples;
>>     E = tuphat[3]/samples;
>>      return (D, E);
>>
>> end
>>
>> srand(77);
>> data = rand(24000,64);
>> limit = 4000;
>> S = rfft(data,1)./24000;
>>
>> @time (D, E) = expensive_hat(S,limit);
>>
>>
>>
>>
>> On Tuesday, February 4, 2014 3:01:17 PM UTC-5, David Salamon wrote:
>>
>>> huh.
>>>
>>> maybe @everywhere in front of the function definition? I'm not sure
>>>
>>>
>>> On Tue, Feb 4, 2014 at 10:53 AM, Alex C <[email protected]> wrote:
>>>
>>>> Thanks for the hint. Getting rid of 'mx' and 'my' definitely helps.
>>>>
>>>> I couldn't figure out how to implement the parallel version of tuple 
>>>> adding. This is what I've got. It crashes my Julia Studio console when I 
>>>> try to run it. What am I missing?
>>>>
>>>> add_two_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>>>
>>>>
>>>> @parallel (add_two_tuple) for i = 1:100
>>>>
>>>>
>>>>
>>>>     x = (1,2,3)
>>>>
>>>> end
>>>>
>>>>
>>>> I also don't have the SharedArray function (running Julia 0.2.0) so I 
>>>> couldn't implement the other alternative.
>>>>
>>>> Alex
>>>>
>>>>
>>>> On Monday, February 3, 2014 11:45:57 AM UTC-5, David Salamon wrote:
>>>>
>>>>> You're not out of the no-slicing woods yet. Looks like you can get rid 
>>>>> of `mx` and `my`
>>>>>
>>>>> for i=1:limit, j=1:int64(limit/2)
>>>>> end
>>>>>
>>>>>
>>>>>
>>>>> As far as parallelizing, you could define:
>>>>> three_tup_add(a, b, c) = (a[1] + b[1] + c[1], a[2] + b[2] + c[2], a[3] 
>>>>> + b[3] + c[3])
>>>>>
>>>>> and then do a @parallel (three_tup_add) over your sample index?
>>>>>
>>>>> for that matter, why not compute the two parts of the answer directly 
>>>>> rather than going via A, B, and C?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>> On Mon, Feb 3, 2014 at 8:11 AM, Alex C <[email protected]> wrote:
>>>>>
>>>>>> Thanks. I've re-written the function to minimize the amount of 
>>>>>> copying (i.e. slicing) that is required. But now, I'm befuddled as to 
>>>>>> how 
>>>>>> to parallelize this function using Julia. Any suggestions?
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> function expensive_hat(S::Array{Complex{Float64},2}, 
>>>>>> mx::Array{Int64,2}, my::Array{Int64,2})
>>>>>>
>>>>>>     samples = 64
>>>>>>         A = zeros(size(mx));
>>>>>>     B = zeros(size(mx));
>>>>>>     C = zeros(size(mx));
>>>>>>
>>>>>>     for i = 1:samples
>>>>>>         Si = S[:,i];
>>>>>>         Sx = Si[mx];
>>>>>>         Sy = Si[my];
>>>>>>         Sxy = Si[mx+my];
>>>>>>         Sxyc = conj(Sxy);
>>>>>>
>>>>>>                 A +=  abs2(Sy .* Sx);
>>>>>>         B += abs2(sqrt(Sxyc .* Sxy));
>>>>>>         C += Sxyc .* Sy .* Sx;
>>>>>>     end
>>>>>>
>>>>>>         ans = (A .* B ./ samples ./ samples, C./samples)
>>>>>>     return ans
>>>>>>
>>>>>> end
>>>>>>
>>>>>> data = rand(24000,64);
>>>>>> limit = 2000;
>>>>>>
>>>>>> ix = int64([1:limit/2]);
>>>>>> iy = ix[1:end/2];
>>>>>> mg = zeros(Int64,length(iy),length(ix));
>>>>>> mx = broadcast(+,ix',mg);
>>>>>> my = broadcast(+,iy,mg);
>>>>>> S = rfft(data,1)./24000;
>>>>>>
>>>>>> @time (AB, C) = expensive_hat(S,mx,my);
>>>>>>  
>>>>>>
>>>>>
>>>>>
>>>
>

Re: [julia-users] Julia Parallel Computing Optimization

Reply via email to