First of all, thank you for a very comprehensive and informative reply. I 
re-checked the performance impact of @simd/@inbounds and it seems you're 
right. The annotations do provide a measurable performance boost on longer 
calculations, but it's slightly under 2% on average and so almost 
negligible. I suspect I must have made other optimizations between 
modifying the annotations previously, leading to an incorrect conclusion.

I tried using the BLAS function for the outer product and it does get 
fairly close to my custom kernel, though not quite there. It's great to see 
how easy Julia makes it to follow method calls that would be quite 
difficult to track down in other languages. I did, however, find a way to 
make my outer product function even faster by passing a preallocated 3x3 
matrix and modifying the elements to avoid repeating the zeros(3,3) call 
every time. This also reduced allocation numbers accordingly.

Using the I statement instead of eye(3) as you suggested gave me a very 
considerable performance boost and I hadn't seen it mentioned anywhere in 
the Julia docs. I'll have to go through the source to see how it does its 
magic, the difference is quite remarkable to me. It's also interesting that 
putting the conditional inside the assignment makes the code run faster 
compared to the other way around, but I can't argue with the results. On 
average, these two changes gave me a steady 40-50% performance boost, which 
is great.

The one thing I was still dissatisfied with was that almost a third of the 
total runtime was still spent on garbage collection, so I re-ran the 
allocation numbers and got the following results:
        - function outer_product!(a, b, n, c)
        0     for i=1:n, j=1:n
        0         c[j,i] = a[j] * b[i]
        -     end
        0     c
        - end
        - 
        - #outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, 
zeros(n, n))
        - 
        - function rotne_prager(x, y, z, r)
        0     const p = length(x) - 1
792000528     rpy = zeros(3*(p+1), 3*(p+1))
      880     rvec = zeros(3)
     1408     c = zeros(3,3)
1406543616     for i=1:p+1, j=1:p+1
        0         rvec[1] = x[j] - x[i]
        0         rvec[2] = y[j] - y[i]
        0         rvec[3] = z[j] - z[i]
        0         Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*
rvec[3])
        0         distance_ratio = r / Rij
        0         rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij > 2*r)
        0             (0.75 * distance_ratio *
        -                 ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * 
I +
        -                 (1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product!(rvec, rvec, 3, c)))
        0             else ((1.0 - 9.0/32.0 / distance_ratio) * I +
        -                 3.0/32.0 / distance_ratio * outer_product!(rvec, 
rvec, 3, c))
        -             end
        -     end
        0     rpy
        - end
        - 
        - function rotne_bench(n)
        0     x = rand(n)
        0     y = rand(n)
        0     z = rand(n)
        0     r = 0.01
        0     rotne_prager(x, y, z, r);
        - end
        - 
        - rotne_bench(1000);
        - @time (for i=1:10; rotne_bench(1000); end)


The one thing that seems odd to me is that huge number next to the for 
loop. Surely the iterator variables allocating during the loop are tiny and 
every other variable except for the two Float64s (Rij and distance_ratio) 
is pre-allocated. Does this mean that a large number of temporary variables 
are generated during the rpy expressions? Intuitively, it seems to be that 
a fairly simple code such as this shouldn't have to spend 30% time on gc(), 
but perhaps I'm wrong.

On Friday, October 24, 2014 7:38:36 PM UTC+3, Jiahao Chen wrote:
>
> Are you sure that @simd and @inbounds give you any performance gains? On 
> my laptop removing all those annotations resulted in essentially no change 
> in performance.
>
> Computing the outer product as a*b' is currently slow because it does not 
> use the specialized method for computing outer products. Instead, b' turns 
> the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector to a 
> nx1-column matrix, then does generic matrix-matrix multiplication (dgemm). 
> You can verify this yourself by running
>
> @which a*b'
>
> and looking at the method of * that is being dispatched on.
>
> As Andreas said, you can try the rank-1 update; one way to use BLAS's 
> function for this is
>
> outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n))
>
> but for 3x3 outer products writing your own kernel may very well be faster 
> because you avoid the cost of making a library call.
>
> For me the biggest speedup came from rewriting your inner kernel as
>
>             rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij > 2*r)
>                 (0.75 * distance_ratio *
>                     ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I +
>                     (1.0 - 2.0*distance_ratio*distance_ratio) * 
> outer_product(rvec, rvec, 3)))
>             else
>                     (1.0 - 9.0/32.0 / distance_ratio) * I +
>                     3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 
> 3)
>             end
>
> - The I symbol is an abstract identity matrix that avoids allocating an 
> explicit matrix.
> - The compiler seems happier when I move the assignment outside the if 
> statement.
>
> Thanks,
>
> Jiahao Chen
> Staff Research Scientist
> MIT Computer Science and Artificial Intelligence Laboratory
>  

Reply via email to