Are you sure that Rij > 2*r in your calculations? You should use I instead of eye(3), but probably the best solution is to avoid the explicit calculation of the outer product. You can probably write your problem as a rank one update of a view of pry
Med venlig hilsen Andreas Noack 2014-10-24 10:26 GMT-04:00 <[email protected]>: > So, I've been gradually porting some of my more performance sensitive > MATLAB research code to Julia, with great results so far. One of my current > projects spends about 80% of the computation time calculating the > Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to > optimize that particular step of the computation as much as possible. > Rewriting the code in Julia alone resulted in an order of magnitude > speedup. Writing my own implementation of the dyadic product of two vectors > gave me another factor of 2 (which I found rather surprising, but still), > as did heavy use of @simd and @inbounds annotations where appropriate. > Despite these successes, I still feel like there is more performance to be > gained here. For one, the code still spends a lot of the time on garbage > collection. Also, running julia with --track-allocation=user gave me some > perplexing results. > > First things first, the relevant code: > function outer_product(a, b, n) > c = zeros(n, n) > @simd for i = 1 : n > @simd for j = 1 : n > @inbounds c[i, j] = a[i] * b[j] > end > end > c > end > > function rotne_prager(x, y, z, r) > const p = length(x) - 1 > rpy = zeros(3*(p+1), 3*(p+1)) > rvec = zeros(3) > @simd for i = 1 : p+1 > @simd for j = 1 : p+1 > @inbounds rvec[1] = x[j] - x[i] > @inbounds rvec[2] = y[j] - y[i] > @inbounds rvec[3] = z[j] - z[i] > Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3 > ]) > distance_ratio = r / Rij > if (Rij > 2*r) > @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * > distance_ratio * > ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye > (3) + > (1.0 - 2.0*distance_ratio*distance_ratio) * > outer_product(rvec, rvec, 3))) > else > @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( > (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + > 3.0/32.0 / distance_ratio * outer_product(rvec, rvec, > 3)) > end > end > end > rpy > end > > function rotne_bench(n) > x = rand(n) > y = rand(n) > z = rand(n) > r = 0.01 > rotne_prager(x, y, z, r); > end > > rotne_bench(1000); > @time (for i=1:10; rotne_bench(1000); end) > > Running this yields the following output on one of my machines: > elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc > time) > > I then tried to add proper type annotations everywhere, but this did not > change anything in the results, so I'm assuming Julia's type inference > system is doing a great job here. Next, I tried to look at the allocations, > using the --track-allocation=user switch (by the way, this did not seem to > produce a *jl.mem file anywhere on my Windows machines, but worked fine on > a Linux box, is this a known behavior?). This yielded the following output: > - function outer_product(a, b, n) > -70967296 c = zeros(n, n) > 0 @simd for i = 1 : n > 0 @simd for j = 1 : n > 0 @inbounds c[i, j] = a[i] * b[j] > - end > - end > 0 c > - end > - > - function rotne_prager(x, y, z, r) > 0 const p = length(x) - 1 > 792000528 rpy = zeros(3*(p+1), 3*(p+1)) > 880 rvec = zeros(3) > 0 @simd for i = 1 : p+1 > 0 @simd for j = 1 : p+1 > 0 @inbounds rvec[1] = x[j] - x[i] > 0 @inbounds rvec[2] = y[j] - y[i] > 0 @inbounds rvec[3] = z[j] - z[i] > 0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[ > 3]*rvec[3]) > 0 distance_ratio = r / Rij > 0 if (Rij > 2*r) > 0 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * > distance_ratio * > - ((1.0 + 2.0/3.0 * distance_ratio* > distance_ratio) * eye(3) + > - (1.0 - 2.0*distance_ratio*distance_ratio) * > outer_product(rvec, rvec, 3))) > - else > 279582208 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( > - (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + > - 3.0/32.0 / distance_ratio * outer_product( > rvec, rvec, 3)) > - end > - end > - end > 0 rpy > 0 end > - > 0 function rotne_bench(n) > 0 x = rand(n) > 0 y = rand(n) > 0 z = rand(n) > 0 r = 0.01 > 0 rotne_prager(x, y, z, r); > - end > - > - rotne_bench(1000); > 0 @time (for i=1:10; rotne_bench(1000); end) > > I find myself somewhat confused by these results, specifically line 27. > Why would the second expression constructing the rpy matrix cause a huge > number of allocations when the first did not? They seem virtually identical > from a programming point of view, with minor differences in numerical > coefficients. I feel like, in my limited understanding of a language I'm > still learning, I'm missing or misunderstanding something fairly obvious. > Any comments, suggestions and criticisms are welcome, both with the regards > to the performance issues as well as my coding style etc. > >
