Re: [julia-users] Trying to optimize numerical code - peculiar allocation data

Andreas Noack Fri, 24 Oct 2014 18:58:16 -0700

Allocating arrays in the inner loop in code like this will unavoidable make
the gc kick in and slow down the execution. Sometimes you can speed things
up a lot with ArrayViews, but in this case the allocation of the views will
also be expensive.


Hence, you can either devectorise your code completely or use pointers. The
following devectorised version is 10x faster on my machine because it
avoids the allocation in the loop.

function rotne_prager1(x, y, z, r)
    p = length(x) - 1
    rpy = zeros(3*(p+1), 3*(p+1))
    rvec = zeros(3)
    c = zeros(3,3)
    @inbounds begin
    for i = 1:p+1, j = 1:p+1
        rvec[1] = x[j] - x[i]
        rvec[2] = y[j] - y[i]
        rvec[3] = z[j] - z[i]
        Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3])
        distance_ratio = r / Rij
        for l = 1:3, k = 1:3
         if Rij > 2*r
         tmp = 0.75 * distance_ratio *(1.0 -
2.0*distance_ratio*distance_ratio) * rvec[k] * rvec[l]
         if k == l
         tmp += 0.75 * distance_ratio * (1.0 + 2.0/3.0 *
distance_ratio*distance_ratio)
         end
         else
         tmp = 3.0/32.0 / distance_ratio * rvec[k] * rvec[l]
         if k == l
         tmp += 1.0 - 9.0/32.0 / distance_ratio
         end
         end
         rpy[3*(j - 1) + k, 3*(i - 1) + l] = tmp
        end
    end
    end
    rpy
end

but unfortunately it is less clear what is going on in this version.

Med venlig hilsen

Andreas Noack

2014-10-24 20:12 GMT-04:00 <[email protected]>:

> I'm currently avoiding the use of FMMs because the systems I run are
> fairly small (colloidal soft matter, typically involving a few hundreds of
> spheres in a small number of aggregates, so each time step is quite fast,
> but the dynamics take a long time to resolve) and I'd rather not sacrifice
> any precision if I don't have to. Neighbor lists might be an interesting
> option to consider, however.
>
> I'm not entirely sure what would be gained by providing a function that
> computes the tensor's action on a vector instead of computing it directly.
> In a Stokes flow, the velocity of an arbitrary assortment of spheres can be
> approximately expressed as v = rpy * force_density, which is basically what
> I'm doing in this case. Assuming the force_density vector isn't sparse,
> wouldn't you have to compute every element of the tensor anyway?
>
> On Friday, October 24, 2014 7:43:56 PM UTC+3, Jiahao Chen wrote:
>>
>> Without knowing much more about your specific problem and use cases, you
>> may also want to consider:
>>
>> - never computing the tensor directly, but provide an indirect
>> representation of it by providing a function that  computes its action on a
>> vector
>> - using a fast multipole algorithm to determine the distance cutoff
>> - reducing the number of explicit distance computations using neighbor
>> lists
>>
>> Thanks,
>>
>> Jiahao Chen
>> Staff Research Scientist
>> MIT Computer Science and Artificial Intelligence Laboratory
>>
>> On Fri, Oct 24, 2014 at 12:38 PM, Jiahao Chen <[email protected]> wrote:
>>
>>> Are you sure that @simd and @inbounds give you any performance gains? On
>>> my laptop removing all those annotations resulted in essentially no change
>>> in performance.
>>>
>>> Computing the outer product as a*b' is currently slow because it does
>>> not use the specialized method for computing outer products. Instead, b'
>>> turns the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector
>>> to a nx1-column matrix, then does generic matrix-matrix multiplication
>>> (dgemm). You can verify this yourself by running
>>>
>>> @which a*b'
>>>
>>> and looking at the method of * that is being dispatched on.
>>>
>>> As Andreas said, you can try the rank-1 update; one way to use BLAS's
>>> function for this is
>>>
>>> outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n))
>>>
>>> but for 3x3 outer products writing your own kernel may very well be
>>> faster because you avoid the cost of making a library call.
>>>
>>> For me the biggest speedup came from rewriting your inner kernel as
>>>
>>>             rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij > 2*r)
>>>                 (0.75 * distance_ratio *
>>>                     ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I
>>> +
>>>                     (1.0 - 2.0*distance_ratio*distance_ratio) *
>>> outer_product(rvec, rvec, 3)))
>>>             else
>>>                     (1.0 - 9.0/32.0 / distance_ratio) * I +
>>>                     3.0/32.0 / distance_ratio * outer_product(rvec,
>>> rvec, 3)
>>>             end
>>>
>>> - The I symbol is an abstract identity matrix that avoids allocating an
>>> explicit matrix.
>>> - The compiler seems happier when I move the assignment outside the if
>>> statement.
>>>
>>> Thanks,
>>>
>>> Jiahao Chen
>>> Staff Research Scientist
>>> MIT Computer Science and Artificial Intelligence Laboratory
>>>
>>> On Fri, Oct 24, 2014 at 11:24 AM, <[email protected]> wrote:
>>>
>>>> I'm absolutely sure that Rij > 2*r most of the time, yes. This is
>>>> obvious if you consider that rand(n) returns values in the range [0, 1] and
>>>> r = 0.01. I have also verified this empirically. I'll try to explore the
>>>> view update idea further, I've never used that approach before.
>>>>
>>>>
>>>> On Friday, October 24, 2014 5:46:37 PM UTC+3, Andreas Noack wrote:
>>>>>
>>>>> Are you sure that Rij > 2*r in your calculations?
>>>>>
>>>>> You should use I instead of eye(3), but probably the best solution is
>>>>> to avoid the explicit calculation of the outer product. You can probably
>>>>> write your problem as a rank one update of a view of pry
>>>>>
>>>>> Med venlig hilsen
>>>>>
>>>>> Andreas Noack
>>>>>
>>>>
>>>
>>

Re: [julia-users] Trying to optimize numerical code - peculiar allocation data

Reply via email to