So, I've been gradually porting some of my more performance sensitive 
MATLAB research code to Julia, with great results so far. One of my current 
projects spends about 80% of the computation time calculating the 
Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to 
optimize that particular step of the computation as much as possible. 
Rewriting the code in Julia alone resulted in an order of magnitude 
speedup. Writing my own implementation of the dyadic product of two vectors 
gave me another factor of 2 (which I found rather surprising, but still), 
as did heavy use of @simd and @inbounds annotations where appropriate. 
Despite these successes, I still feel like there is more performance to be 
gained here. For one, the code still spends a lot of the time on garbage 
collection. Also, running julia with --track-allocation=user gave me some 
perplexing results.

First things first, the relevant code:
function outer_product(a, b, n)
    c = zeros(n, n)
    @simd for i = 1 : n
        @simd for j = 1 : n
            @inbounds c[i, j] = a[i] * b[j]
        end
    end
    c
end

function rotne_prager(x, y, z, r)
    const p = length(x) - 1
    rpy = zeros(3*(p+1), 3*(p+1))
    rvec = zeros(3)
    @simd for i = 1 : p+1
        @simd for j = 1 : p+1
            @inbounds rvec[1] = x[j] - x[i]
            @inbounds rvec[2] = y[j] - y[i]
            @inbounds rvec[3] = z[j] - z[i]
            Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3])
            distance_ratio = r / Rij
            if (Rij > 2*r)
                @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * 
distance_ratio *
                    ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3
) +
                    (1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product(rvec, rvec, 3)))
            else
                @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
                    (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
                    3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3
))
            end
        end
    end
    rpy
end

function rotne_bench(n)
    x = rand(n)
    y = rand(n)
    z = rand(n)
    r = 0.01
    rotne_prager(x, y, z, r);
end

rotne_bench(1000);
@time (for i=1:10; rotne_bench(1000); end)

Running this yields the following output on one of my machines:
elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc 
time)

I then tried to add proper type annotations everywhere, but this did not 
change anything in the results, so I'm assuming Julia's type inference 
system is doing a great job here. Next, I tried to look at the allocations, 
using the --track-allocation=user switch (by the way, this did not seem to 
produce a *jl.mem file anywhere on my Windows machines, but worked fine on 
a Linux box, is this a known behavior?). This yielded the following output:
        - function outer_product(a, b, n)
-70967296     c = zeros(n, n)
        0     @simd for i = 1 : n
        0         @simd for j = 1 : n
        0             @inbounds c[i, j] = a[i] * b[j]
        -         end
        -     end
        0     c
        - end
        - 
        - function rotne_prager(x, y, z, r)
        0     const p = length(x) - 1
792000528     rpy = zeros(3*(p+1), 3*(p+1))
      880     rvec = zeros(3)
        0     @simd for i = 1 : p+1
        0         @simd for j = 1 : p+1
        0             @inbounds rvec[1] = x[j] - x[i]
        0             @inbounds rvec[2] = y[j] - y[i]
        0             @inbounds rvec[3] = z[j] - z[i]
        0             Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3
]*rvec[3])
        0             distance_ratio = r / Rij
        0             if (Rij > 2*r)
        0                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * 
distance_ratio *
        -                     ((1.0 + 2.0/3.0 * distance_ratio*
distance_ratio) * eye(3) +
        -                     (1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product(rvec, rvec, 3)))
        -             else
279582208                 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
        -                     (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
        -                     3.0/32.0 / distance_ratio * outer_product(rvec
, rvec, 3))
        -             end
        -         end
        -     end
        0     rpy
        0 end
        - 
        0 function rotne_bench(n)
        0     x = rand(n)
        0     y = rand(n)
        0     z = rand(n)
        0     r = 0.01
        0     rotne_prager(x, y, z, r);
        - end
        - 
        - rotne_bench(1000);
        0 @time (for i=1:10; rotne_bench(1000); end)

I find myself somewhat confused by these results, specifically line 27. Why 
would the second expression constructing the rpy matrix cause a huge number 
of allocations when the first did not? They seem virtually identical from a 
programming point of view, with minor differences in numerical 
coefficients. I feel like, in my limited understanding of a language I'm 
still learning, I'm missing or misunderstanding something fairly obvious. 
Any comments, suggestions and criticisms are welcome, both with the regards 
to the performance issues as well as my coding style etc.

Reply via email to