Woah, also:
A = B = zeros(Float64,Ly,Lx);
is almost surely not what you intended.
julia> A = B = [1 2]
1x2 Array{Int64,2}:
1 2
julia> A[1] = 10
10
julia> B
1x2 Array{Int64,2}:
10 2
On Tue, Feb 4, 2014 at 2:15 PM, David Salamon <[email protected]> wrote:
> 1. You should change:
> C = complex(zeros(Float64, Ly, Lx)
> to:
> C = zeros(Complex{Float64}, Ly, Lx)
> [the way you are doing it there creates a float version, then a complex
> version, then trashes it]
>
> 2. The algorithm after the above change allocates 3 * limit * (limit/2)
> * samples * 16 bytes in the @parallel loop
> [3 being one allocation for A, B, and C ; 16 being
> sizeof(Complex{Float64})]
> [without change #1 it's 4 * limit * (limit/2) * samples * 16]
> [so yeah, makes sense that would grow fast as limit increases]
> You could reduce this by 50% by running the paralel computation over D and
> E directly, then scaling it down afterwards. Which would also be faster,
> because you don't have to do that element-wise multiply or the element-wise
> scaling. You're walking D 3 times more often than you need to.
>
> 3. You could consider going Float32 (factor of 2 savings of course), or
> Float16 if those would meet your needs.
>
> 4. Style-wise, you should do a deconstructing bind instead of using tuphat:
> (A, B, C) = @parallel (add_three_tuples) for ...
>
> or really:
> (D, E) = @parallel (add_two_tuples) for ...
>
> 5. You'll probably get better numerical stability if you scale down prior
> to the addition step (so doing D[i,j] += Aij * Bij / samples / samples
> instead of D[i,j] += Aij * Bij and then dividing before returning)
>
>
>
> If you make the above changes (specifically 1 and 2) you'll have the
> allocation down to the minimum size possible for this approach to
> threading... ie (threads) * (size of returned data). If the thread factor
> remains a problem, it's not *so* bad compiling julia from source to get
> SharedArrays.
>
> The only other thing to consider would be how you are using the return
> value. Perhaps you could get away with computing a smaller sized return
> value?
>
>
> On Tue, Feb 4, 2014 at 1:49 PM, Alex C <[email protected]> wrote:
>
>> Sorry for the confusion. I was trying to get a simple example to work so
>> I wouldn't get distracted by details. The @everywhere did the trick.
>> This is the fastest parallel version of the code that I was able to get
>> working. However, I easily run into memory limitations (8GB RAM) as I
>> increase the variable 'limit'.
>>
>> function expensive_hat(S::Array{Complex{Float64},2},limit::Int64)
>>
>> samples = size(S,2);
>> Ly = int64(limit/4);
>> Lx = int64(limit/2);
>>
>> @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>
>> tuphat = @parallel (add_three_tuple) for k = 1:samples
>> A = B = zeros(Float64,Ly,Lx);
>> C = complex(zeros(Float64,Ly,Lx));
>> for j = 1:Lx, i = 1:Ly
>> Sx = S[i,k];
>> Sy = S[j,k];
>> Sxy = S[i+j,k];
>> Sxyc = conj(Sxy);
>> A[i,j] += abs2(Sy*Sx);
>> B[i,j] += abs2(sqrt(Sxyc*Sxy));
>> C[i,j] += Sxyc*Sy*Sx;
>> end
>> (A, B, C)
>> end
>>
>> D = tuphat[1].*tuphat[2]/samples/samples;
>> E = tuphat[3]/samples;
>> return (D, E);
>>
>> end
>>
>> srand(77);
>> data = rand(24000,64);
>> limit = 4000;
>> S = rfft(data,1)./24000;
>>
>> @time (D, E) = expensive_hat(S,limit);
>>
>>
>>
>>
>> On Tuesday, February 4, 2014 3:01:17 PM UTC-5, David Salamon wrote:
>>
>>> huh.
>>>
>>> maybe @everywhere in front of the function definition? I'm not sure
>>>
>>>
>>> On Tue, Feb 4, 2014 at 10:53 AM, Alex C <[email protected]> wrote:
>>>
>>>> Thanks for the hint. Getting rid of 'mx' and 'my' definitely helps.
>>>>
>>>> I couldn't figure out how to implement the parallel version of tuple
>>>> adding. This is what I've got. It crashes my Julia Studio console when I
>>>> try to run it. What am I missing?
>>>>
>>>> add_two_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3])
>>>>
>>>>
>>>> @parallel (add_two_tuple) for i = 1:100
>>>>
>>>>
>>>>
>>>> x = (1,2,3)
>>>>
>>>> end
>>>>
>>>>
>>>> I also don't have the SharedArray function (running Julia 0.2.0) so I
>>>> couldn't implement the other alternative.
>>>>
>>>> Alex
>>>>
>>>>
>>>> On Monday, February 3, 2014 11:45:57 AM UTC-5, David Salamon wrote:
>>>>
>>>>> You're not out of the no-slicing woods yet. Looks like you can get rid
>>>>> of `mx` and `my`
>>>>>
>>>>> for i=1:limit, j=1:int64(limit/2)
>>>>> end
>>>>>
>>>>>
>>>>>
>>>>> As far as parallelizing, you could define:
>>>>> three_tup_add(a, b, c) = (a[1] + b[1] + c[1], a[2] + b[2] + c[2], a[3]
>>>>> + b[3] + c[3])
>>>>>
>>>>> and then do a @parallel (three_tup_add) over your sample index?
>>>>>
>>>>> for that matter, why not compute the two parts of the answer directly
>>>>> rather than going via A, B, and C?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 3, 2014 at 8:11 AM, Alex C <[email protected]> wrote:
>>>>>
>>>>>> Thanks. I've re-written the function to minimize the amount of
>>>>>> copying (i.e. slicing) that is required. But now, I'm befuddled as to how
>>>>>> to parallelize this function using Julia. Any suggestions?
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> function expensive_hat(S::Array{Complex{Float64},2},
>>>>>> mx::Array{Int64,2}, my::Array{Int64,2})
>>>>>>
>>>>>> samples = 64
>>>>>> A = zeros(size(mx));
>>>>>> B = zeros(size(mx));
>>>>>> C = zeros(size(mx));
>>>>>>
>>>>>> for i = 1:samples
>>>>>> Si = S[:,i];
>>>>>> Sx = Si[mx];
>>>>>> Sy = Si[my];
>>>>>> Sxy = Si[mx+my];
>>>>>> Sxyc = conj(Sxy);
>>>>>>
>>>>>> A += abs2(Sy .* Sx);
>>>>>> B += abs2(sqrt(Sxyc .* Sxy));
>>>>>> C += Sxyc .* Sy .* Sx;
>>>>>> end
>>>>>>
>>>>>> ans = (A .* B ./ samples ./ samples, C./samples)
>>>>>> return ans
>>>>>>
>>>>>> end
>>>>>>
>>>>>> data = rand(24000,64);
>>>>>> limit = 2000;
>>>>>>
>>>>>> ix = int64([1:limit/2]);
>>>>>> iy = ix[1:end/2];
>>>>>> mg = zeros(Int64,length(iy),length(ix));
>>>>>> mx = broadcast(+,ix',mg);
>>>>>> my = broadcast(+,iy,mg);
>>>>>> S = rfft(data,1)./24000;
>>>>>>
>>>>>> @time (AB, C) = expensive_hat(S,mx,my);
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>