So, I've been gradually porting some of my more performance sensitive
MATLAB research code to Julia, with great results so far. One of my current
projects spends about 80% of the computation time calculating the
Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to
optimize that particular step of the computation as much as possible.
Rewriting the code in Julia alone resulted in an order of magnitude
speedup. Writing my own implementation of the dyadic product of two vectors
gave me another factor of 2 (which I found rather surprising, but still),
as did heavy use of @simd and @inbounds annotations where appropriate.
Despite these successes, I still feel like there is more performance to be
gained here. For one, the code still spends a lot of the time on garbage
collection. Also, running julia with --track-allocation=user gave me some
perplexing results.
First things first, the relevant code:
function outer_product(a, b, n)
c = zeros(n, n)
@simd for i = 1 : n
@simd for j = 1 : n
@inbounds c[i, j] = a[i] * b[j]
end
end
c
end
function rotne_prager(x, y, z, r)
const p = length(x) - 1
rpy = zeros(3*(p+1), 3*(p+1))
rvec = zeros(3)
@simd for i = 1 : p+1
@simd for j = 1 : p+1
@inbounds rvec[1] = x[j] - x[i]
@inbounds rvec[2] = y[j] - y[i]
@inbounds rvec[3] = z[j] - z[i]
Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3])
distance_ratio = r / Rij
if (Rij > 2*r)
@inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 *
distance_ratio *
((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3
) +
(1.0 - 2.0*distance_ratio*distance_ratio) *
outer_product(rvec, rvec, 3)))
else
@inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
(1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3
))
end
end
end
rpy
end
function rotne_bench(n)
x = rand(n)
y = rand(n)
z = rand(n)
r = 0.01
rotne_prager(x, y, z, r);
end
rotne_bench(1000);
@time (for i=1:10; rotne_bench(1000); end)
Running this yields the following output on one of my machines:
elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc
time)
I then tried to add proper type annotations everywhere, but this did not
change anything in the results, so I'm assuming Julia's type inference
system is doing a great job here. Next, I tried to look at the allocations,
using the --track-allocation=user switch (by the way, this did not seem to
produce a *jl.mem file anywhere on my Windows machines, but worked fine on
a Linux box, is this a known behavior?). This yielded the following output:
- function outer_product(a, b, n)
-70967296 c = zeros(n, n)
0 @simd for i = 1 : n
0 @simd for j = 1 : n
0 @inbounds c[i, j] = a[i] * b[j]
- end
- end
0 c
- end
-
- function rotne_prager(x, y, z, r)
0 const p = length(x) - 1
792000528 rpy = zeros(3*(p+1), 3*(p+1))
880 rvec = zeros(3)
0 @simd for i = 1 : p+1
0 @simd for j = 1 : p+1
0 @inbounds rvec[1] = x[j] - x[i]
0 @inbounds rvec[2] = y[j] - y[i]
0 @inbounds rvec[3] = z[j] - z[i]
0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3
]*rvec[3])
0 distance_ratio = r / Rij
0 if (Rij > 2*r)
0 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 *
distance_ratio *
- ((1.0 + 2.0/3.0 * distance_ratio*
distance_ratio) * eye(3) +
- (1.0 - 2.0*distance_ratio*distance_ratio) *
outer_product(rvec, rvec, 3)))
- else
279582208 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
- (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
- 3.0/32.0 / distance_ratio * outer_product(rvec
, rvec, 3))
- end
- end
- end
0 rpy
0 end
-
0 function rotne_bench(n)
0 x = rand(n)
0 y = rand(n)
0 z = rand(n)
0 r = 0.01
0 rotne_prager(x, y, z, r);
- end
-
- rotne_bench(1000);
0 @time (for i=1:10; rotne_bench(1000); end)
I find myself somewhat confused by these results, specifically line 27. Why
would the second expression constructing the rpy matrix cause a huge number
of allocations when the first did not? They seem virtually identical from a
programming point of view, with minor differences in numerical
coefficients. I feel like, in my limited understanding of a language I'm
still learning, I'm missing or misunderstanding something fairly obvious.
Any comments, suggestions and criticisms are welcome, both with the regards
to the performance issues as well as my coding style etc.