If you are are frequently calling rotne_prager with the same number of 
"positions" you could preallocate rpy and pass it to your function. This won't 
help you much in this benchmark, but is probably better for a propagation 
scheme. Also it seems that your tensor is symmetric (i<->j) since it depends 
only on the "distance vector squared"? You could use that symmetry and have 
your loop over j starting at j=i (maybe I am missing something though).

Best,

Alex.


On Friday, 24 October 2014 16:26:20 UTC+2, [email protected]  wrote:
> So, I've been gradually porting some of my more performance sensitive MATLAB 
> research code to Julia, with great results so far. One of my current projects 
> spends about 80% of the computation time calculating the 
> Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to 
> optimize that particular step of the computation as much as possible. 
> Rewriting the code in Julia alone resulted in an order of magnitude speedup. 
> Writing my own implementation of the dyadic product of two vectors gave me 
> another factor of 2 (which I found rather surprising, but still), as did 
> heavy use of @simd and @inbounds annotations where appropriate. Despite these 
> successes, I still feel like there is more performance to be gained here. For 
> one, the code still spends a lot of the time on garbage collection. Also, 
> running julia with --track-allocation=user gave me some perplexing results.
> 
> First things first, the relevant code:
> 
> 
> function outer_product(a, b, n)
>     c = zeros(n, n)
>     @simd for i = 1 : n
>         @simd for j = 1 : n
>             @inbounds c[i, j] = a[i] * b[j]
>         end
>     end
>     c
> end
> 
> function rotne_prager(x, y, z, r)
>     const p = length(x) - 1
>     rpy = zeros(3*(p+1), 3*(p+1))
>     rvec = zeros(3)
>     @simd for i = 1 : p+1
>         @simd for j = 1 : p+1
>             @inbounds rvec[1] = x[j] - x[i]
>             @inbounds rvec[2] = y[j] - y[i]
>             @inbounds rvec[3] = z[j] - z[i]
>             Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3])
>             distance_ratio = r / Rij
>             if (Rij > 2*r)
>                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * distance_ratio *
>                     ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3) 
> +
>                     (1.0 - 2.0*distance_ratio*distance_ratio) * 
> outer_product(rvec, rvec, 3)))
>             else
>                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
>                     (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
>                     3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3))
>             end
>         end
>     end
>     rpy
> end
> 
> function rotne_bench(n)
>     x = rand(n)
>     y = rand(n)
>     z = rand(n)
>     r = 0.01
>     rotne_prager(x, y, z, r);
> end
> 
> rotne_bench(1000);
> @time (for i=1:10; rotne_bench(1000); end)
> 
> Running this yields the following output on one of my machines:
> 
> 
> elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc 
> time)
> I then tried to add proper type annotations everywhere, but this did not 
> change anything in the results, so I'm assuming Julia's type inference system 
> is doing a great job here. Next, I tried to look at the allocations, using 
> the --track-allocation=user switch (by the way, this did not seem to produce 
> a *jl.mem file anywhere on my Windows machines, but worked fine on a Linux 
> box, is this a known behavior?). This yielded the following output:
> 
> 
>         - function outer_product(a, b, n)
> -70967296     c = zeros(n, n)
>         0     @simd for i = 1 : n
>         0         @simd for j = 1 : n
>         0             @inbounds c[i, j] = a[i] * b[j]
>         -         end
>         -     end
>         0     c
>         - end
>         - 
>         - function rotne_prager(x, y, z, r)
>         0     const p = length(x) - 1
> 792000528     rpy = zeros(3*(p+1), 3*(p+1))
>       880     rvec = zeros(3)
>         0     @simd for i = 1 : p+1
>         0         @simd for j = 1 : p+1
>         0             @inbounds rvec[1] = x[j] - x[i]
>         0             @inbounds rvec[2] = y[j] - y[i]
>         0             @inbounds rvec[3] = z[j] - z[i]
>         0             Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + 
> rvec[3]*rvec[3])
>         0             distance_ratio = r / Rij
>         0             if (Rij > 2*r)
>         0                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * 
> distance_ratio *
>         -                     ((1.0 + 2.0/3.0 * 
> distance_ratio*distance_ratio) * eye(3) +
>         -                     (1.0 - 2.0*distance_ratio*distance_ratio) * 
> outer_product(rvec, rvec, 3)))
>         -             else
> 279582208                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
>         -                     (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
>         -                     3.0/32.0 / distance_ratio * outer_product(rvec, 
> rvec, 3))
>         -             end
>         -         end
>         -     end
>         0     rpy
>         0 end
>         - 
>         0 function rotne_bench(n)
>         0     x = rand(n)
>         0     y = rand(n)
>         0     z = rand(n)
>         0     r = 0.01
>         0     rotne_prager(x, y, z, r);
>         - end
>         - 
>         - rotne_bench(1000);
>         0 @time (for i=1:10; rotne_bench(1000); end)
> 
> I find myself somewhat confused by these results, specifically line 27. Why 
> would the second expression constructing the rpy matrix cause a huge number 
> of allocations when the first did not? They seem virtually identical from a 
> programming point of view, with minor differences in numerical coefficients. 
> I feel like, in my limited understanding of a language I'm still learning, 
> I'm missing or misunderstanding something fairly obvious. Any comments, 
> suggestions and criticisms are welcome, both with the regards to the 
> performance issues as well as my coding style etc.

Reply via email to