If you are are frequently calling rotne_prager with the same number of "positions" you could preallocate rpy and pass it to your function. This won't help you much in this benchmark, but is probably better for a propagation scheme. Also it seems that your tensor is symmetric (i<->j) since it depends only on the "distance vector squared"? You could use that symmetry and have your loop over j starting at j=i (maybe I am missing something though).
Best, Alex. On Friday, 24 October 2014 16:26:20 UTC+2, [email protected] wrote: > So, I've been gradually porting some of my more performance sensitive MATLAB > research code to Julia, with great results so far. One of my current projects > spends about 80% of the computation time calculating the > Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to > optimize that particular step of the computation as much as possible. > Rewriting the code in Julia alone resulted in an order of magnitude speedup. > Writing my own implementation of the dyadic product of two vectors gave me > another factor of 2 (which I found rather surprising, but still), as did > heavy use of @simd and @inbounds annotations where appropriate. Despite these > successes, I still feel like there is more performance to be gained here. For > one, the code still spends a lot of the time on garbage collection. Also, > running julia with --track-allocation=user gave me some perplexing results. > > First things first, the relevant code: > > > function outer_product(a, b, n) > c = zeros(n, n) > @simd for i = 1 : n > @simd for j = 1 : n > @inbounds c[i, j] = a[i] * b[j] > end > end > c > end > > function rotne_prager(x, y, z, r) > const p = length(x) - 1 > rpy = zeros(3*(p+1), 3*(p+1)) > rvec = zeros(3) > @simd for i = 1 : p+1 > @simd for j = 1 : p+1 > @inbounds rvec[1] = x[j] - x[i] > @inbounds rvec[2] = y[j] - y[i] > @inbounds rvec[3] = z[j] - z[i] > Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3]) > distance_ratio = r / Rij > if (Rij > 2*r) > @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * distance_ratio * > ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3) > + > (1.0 - 2.0*distance_ratio*distance_ratio) * > outer_product(rvec, rvec, 3))) > else > @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( > (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + > 3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3)) > end > end > end > rpy > end > > function rotne_bench(n) > x = rand(n) > y = rand(n) > z = rand(n) > r = 0.01 > rotne_prager(x, y, z, r); > end > > rotne_bench(1000); > @time (for i=1:10; rotne_bench(1000); end) > > Running this yields the following output on one of my machines: > > > elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc > time) > I then tried to add proper type annotations everywhere, but this did not > change anything in the results, so I'm assuming Julia's type inference system > is doing a great job here. Next, I tried to look at the allocations, using > the --track-allocation=user switch (by the way, this did not seem to produce > a *jl.mem file anywhere on my Windows machines, but worked fine on a Linux > box, is this a known behavior?). This yielded the following output: > > > - function outer_product(a, b, n) > -70967296 c = zeros(n, n) > 0 @simd for i = 1 : n > 0 @simd for j = 1 : n > 0 @inbounds c[i, j] = a[i] * b[j] > - end > - end > 0 c > - end > - > - function rotne_prager(x, y, z, r) > 0 const p = length(x) - 1 > 792000528 rpy = zeros(3*(p+1), 3*(p+1)) > 880 rvec = zeros(3) > 0 @simd for i = 1 : p+1 > 0 @simd for j = 1 : p+1 > 0 @inbounds rvec[1] = x[j] - x[i] > 0 @inbounds rvec[2] = y[j] - y[i] > 0 @inbounds rvec[3] = z[j] - z[i] > 0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + > rvec[3]*rvec[3]) > 0 distance_ratio = r / Rij > 0 if (Rij > 2*r) > 0 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * > distance_ratio * > - ((1.0 + 2.0/3.0 * > distance_ratio*distance_ratio) * eye(3) + > - (1.0 - 2.0*distance_ratio*distance_ratio) * > outer_product(rvec, rvec, 3))) > - else > 279582208 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( > - (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + > - 3.0/32.0 / distance_ratio * outer_product(rvec, > rvec, 3)) > - end > - end > - end > 0 rpy > 0 end > - > 0 function rotne_bench(n) > 0 x = rand(n) > 0 y = rand(n) > 0 z = rand(n) > 0 r = 0.01 > 0 rotne_prager(x, y, z, r); > - end > - > - rotne_bench(1000); > 0 @time (for i=1:10; rotne_bench(1000); end) > > I find myself somewhat confused by these results, specifically line 27. Why > would the second expression constructing the rpy matrix cause a huge number > of allocations when the first did not? They seem virtually identical from a > programming point of view, with minor differences in numerical coefficients. > I feel like, in my limited understanding of a language I'm still learning, > I'm missing or misunderstanding something fairly obvious. Any comments, > suggestions and criticisms are welcome, both with the regards to the > performance issues as well as my coding style etc.
