Hey Alex, Great catch on #5 -- that was dumb on my part :)
Re #4: turns out it's a known bug involving parallel code specifically. I've upadated https://github.com/JuliaLang/julia/issues/2669 to let them know it's no longer a theoretic discussion. Re: breaking the matrix in to pieces -- splitting the thread-work here would allow the #5 optimization as well (since you could have different threads to the different blocks rather than the different samples). I still think SharedArrays are the way to go :p Anyway, glad to see how things worked out. Cheers, David On Wed, Feb 5, 2014 at 7:27 AM, Alex C <[email protected]> wrote: > 1. Done > 2. I am not sure what you mean by "running the paralel computation over D > and E directly, then scaling it down afterwards" but I do the scaling > element-wise now instead of at the end > 3. I will consider it in the future > 4. Done although it won't allow me to name the output (A,B,C). It gives an > ErrorException("A not defined") > 5. I don't think I can do that operation element-wise since it is not > mathematically equivalent. That is sum(A)*sum(B) != sum(A*B) [omitting the > scaling part]. The multiplicative step has to be done after the summation. > This could be done element-wise if samples becomes the inner loop. > > The latest version of parallel code does run slightly faster than > previously. > The elements in the return values are independent of one another so I > could break the matrix into pieces and compute it piecewise. I didn't want > to go down that route quite yet. > > Alex > > function expensive_hat(S::Array{Complex{Float64},2},limit::Int64) > > samples = size(S,2); > Ly = int64(limit/4); > Lx = int64(limit/2); > > @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3]) > > (a,b,e) = @parallel (add_three_tuple) for k = 1:samples > A = zeros(Float64,Ly,Lx); > B = zeros(Float64,Ly,Lx); > E = zeros(Complex{Float64},Ly,Lx); > for j = 1:Lx, i = 1:Ly > Sx = S[i,k]; > Sy = S[j,k]; > Sxy = S[i+j,k]; > Sxyc = conj(Sxy); > A[i,j] += abs2(Sy*Sx)/samples; > B[i,j] += abs2(sqrt(Sxyc*Sxy))/samples; > E[i,j] += Sxyc*Sy*Sx/samples; > end > (A,B,E) > end > > d = a.*b; > return (d, e) > > end > > srand(77); > data = rand(24000,64); > limit = 4000; > S = rfft(data,1)./24000; > > @time (D, E) = expensive_hat(S,limit); > > > > On Tuesday, February 4, 2014 5:15:08 PM UTC-5, David Salamon wrote: > >> 1. You should change: >> C = complex(zeros(Float64, Ly, Lx) >> to: >> C = zeros(Complex{Float64}, Ly, Lx) >> [the way you are doing it there creates a float version, then a complex >> version, then trashes it] >> >> 2. The algorithm after the above change allocates 3 * limit * (limit/2) >> * samples * 16 bytes in the @parallel loop >> [3 being one allocation for A, B, and C ; 16 being >> sizeof(Complex{Float64})] >> [without change #1 it's 4 * limit * (limit/2) * samples * 16] >> [so yeah, makes sense that would grow fast as limit increases] >> You could reduce this by 50% by running the paralel computation over D >> and E directly, then scaling it down afterwards. Which would also be >> faster, because you don't have to do that element-wise multiply or the >> element-wise scaling. You're walking D 3 times more often than you need to. >> >> 3. You could consider going Float32 (factor of 2 savings of course), or >> Float16 if those would meet your needs. >> >> 4. Style-wise, you should do a deconstructing bind instead of using >> tuphat: >> (A, B, C) = @parallel (add_three_tuples) for ... >> >> or really: >> (D, E) = @parallel (add_two_tuples) for ... >> >> 5. You'll probably get better numerical stability if you scale down prior >> to the addition step (so doing D[i,j] += Aij * Bij / samples / samples >> instead of D[i,j] += Aij * Bij and then dividing before returning) >> >> >> >> If you make the above changes (specifically 1 and 2) you'll have the >> allocation down to the minimum size possible for this approach to >> threading... ie (threads) * (size of returned data). If the thread factor >> remains a problem, it's not *so* bad compiling julia from source to get >> SharedArrays. >> >> The only other thing to consider would be how you are using the return >> value. Perhaps you could get away with computing a smaller sized return >> value? >> >> >> On Tue, Feb 4, 2014 at 1:49 PM, Alex C <[email protected]> wrote: >> >>> Sorry for the confusion. I was trying to get a simple example to work so >>> I wouldn't get distracted by details. The @everywhere did the trick. >>> This is the fastest parallel version of the code that I was able to get >>> working. However, I easily run into memory limitations (8GB RAM) as I >>> increase the variable 'limit'. >>> >>> function expensive_hat(S::Array{Complex{Float64},2},limit::Int64) >>> >>> samples = size(S,2); >>> Ly = int64(limit/4); >>> Lx = int64(limit/2); >>> >>> @everywhere add_three_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3]) >>> >>> tuphat = @parallel (add_three_tuple) for k = 1:samples >>> A = B = zeros(Float64,Ly,Lx); >>> C = complex(zeros(Float64,Ly,Lx)); >>> for j = 1:Lx, i = 1:Ly >>> Sx = S[i,k]; >>> Sy = S[j,k]; >>> Sxy = S[i+j,k]; >>> Sxyc = conj(Sxy); >>> A[i,j] += abs2(Sy*Sx); >>> B[i,j] += abs2(sqrt(Sxyc*Sxy)); >>> C[i,j] += Sxyc*Sy*Sx; >>> end >>> (A, B, C) >>> end >>> >>> D = tuphat[1].*tuphat[2]/samples/samples; >>> E = tuphat[3]/samples; >>> return (D, E); >>> >>> end >>> >>> srand(77); >>> data = rand(24000,64); >>> limit = 4000; >>> S = rfft(data,1)./24000; >>> >>> @time (D, E) = expensive_hat(S,limit); >>> >>> >>> >>> >>> On Tuesday, February 4, 2014 3:01:17 PM UTC-5, David Salamon wrote: >>> >>>> huh. >>>> >>>> maybe @everywhere in front of the function definition? I'm not sure >>>> >>>> >>>> On Tue, Feb 4, 2014 at 10:53 AM, Alex C <[email protected]> wrote: >>>> >>>>> Thanks for the hint. Getting rid of 'mx' and 'my' definitely helps. >>>>> >>>>> I couldn't figure out how to implement the parallel version of tuple >>>>> adding. This is what I've got. It crashes my Julia Studio console when I >>>>> try to run it. What am I missing? >>>>> >>>>> add_two_tuple(x,y) = (x[1]+y[1], x[2]+y[2], x[3]+y[3]) >>>>> >>>>> >>>>> @parallel (add_two_tuple) for i = 1:100 >>>>> >>>>> >>>>> >>>>> >>>>> x = (1,2,3) >>>>> >>>>> end >>>>> >>>>> >>>>> I also don't have the SharedArray function (running Julia 0.2.0) so I >>>>> couldn't implement the other alternative. >>>>> >>>>> Alex >>>>> >>>>> >>>>> On Monday, February 3, 2014 11:45:57 AM UTC-5, David Salamon wrote: >>>>> >>>>>> You're not out of the no-slicing woods yet. Looks like you can get >>>>>> rid of `mx` and `my` >>>>>> >>>>>> for i=1:limit, j=1:int64(limit/2) >>>>>> end >>>>>> >>>>>> >>>>>> >>>>>> As far as parallelizing, you could define: >>>>>> three_tup_add(a, b, c) = (a[1] + b[1] + c[1], a[2] + b[2] + c[2], >>>>>> a[3] + b[3] + c[3]) >>>>>> >>>>>> and then do a @parallel (three_tup_add) over your sample index? >>>>>> >>>>>> for that matter, why not compute the two parts of the answer directly >>>>>> rather than going via A, B, and C? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 3, 2014 at 8:11 AM, Alex C <[email protected]> wrote: >>>>>> >>>>>>> Thanks. I've re-written the function to minimize the amount of >>>>>>> copying (i.e. slicing) that is required. But now, I'm befuddled as to >>>>>>> how >>>>>>> to parallelize this function using Julia. Any suggestions? >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> function expensive_hat(S::Array{Complex{Float64},2}, >>>>>>> mx::Array{Int64,2}, my::Array{Int64,2}) >>>>>>> >>>>>>> samples = 64 >>>>>>> A = zeros(size(mx)); >>>>>>> B = zeros(size(mx)); >>>>>>> C = zeros(size(mx)); >>>>>>> >>>>>>> for i = 1:samples >>>>>>> Si = S[:,i]; >>>>>>> Sx = Si[mx]; >>>>>>> Sy = Si[my]; >>>>>>> Sxy = Si[mx+my]; >>>>>>> Sxyc = conj(Sxy); >>>>>>> >>>>>>> A += abs2(Sy .* Sx); >>>>>>> B += abs2(sqrt(Sxyc .* Sxy)); >>>>>>> C += Sxyc .* Sy .* Sx; >>>>>>> end >>>>>>> >>>>>>> ans = (A .* B ./ samples ./ samples, C./samples) >>>>>>> return ans >>>>>>> >>>>>>> end >>>>>>> >>>>>>> data = rand(24000,64); >>>>>>> limit = 2000; >>>>>>> >>>>>>> ix = int64([1:limit/2]); >>>>>>> iy = ix[1:end/2]; >>>>>>> mg = zeros(Int64,length(iy),length(ix)); >>>>>>> mx = broadcast(+,ix',mg); >>>>>>> my = broadcast(+,iy,mg); >>>>>>> S = rfft(data,1)./24000; >>>>>>> >>>>>>> @time (AB, C) = expensive_hat(S,mx,my); >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>
