Thanks Ivar and Tom, that did the trick, and the performance is in reach of
Fortran:

*9.5e-5s*
I will implement the missing functionality and do some cleanup so this can
go into ODE.jl. Thanks for the help everybody!


2013/12/18 Ivar Nesje <[email protected]>

> I did not intend to suggest that it would be good for the interface to
> hard code the function. I just noted that some overhead vanishes. There
> might be some metaprogramming tricks you can do to get your desired
> interface back.
>
> The usual idea in Julia for returning an array, without allocation is the
> same as in C. You must allocate the array in the calling function and
> modify it inside the function. In juila the convention is that such
> functions end in ! and modify their first argument.
>
>
> function gravity!(ret::Vector{Float64}, t::Float64, y::Vector{Float64}, 
> mu::Float64)
>     r = norm(y[1:3])
>     ret[1:3] = y[4:6]
>     ret[4:6] = -mu*y[1:3]/r/r/r
> end
>
>
> Unfortunately Julia does not support efficient slicing so I was unsure if
> you could gain anything from this approach, becuase you use the returned
> array on multiple occasions later.
>
> kl. 15:09:18 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn følgende:
>>
>> I have updated the 
>> Gist<https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhelgee%2F8019521&sa=D&sntz=1&usg=AFQjCNFFC_2GgoDiVMWwe6WpBpKmBv2W5w>
>> .
>>
>>
>> 2013/12/18 Helge Eichhorn <[email protected]>
>>
>>> @Ivar: Using a regular function is only possible in this special case,
>>> because generally the user must supply the callback function.
>>>
>>> @Tim: I think that I can also rule out type problems. At least the
>>> output of code_typed is clean of Unions. Also did I forget to mention that
>>> I already did a lot of profiling.
>>>
>>> What did help was re-writing the callback function like so:
>>>
>>> function gravity1(t::Float64, y::Vector{Float64}, mu::Float64)
>>>     x, y, z, vx, vy, vz = y
>>>     r = sqrt(x*x+y*y+z*z)
>>>     r3 = r*r*r
>>>     [vx, vy, vz, -mu*x/r3, -mu*y/r3, -mu*z/r3]
>>> end
>>>
>>> and especially reducing memory allocations by writing the intermediary
>>> solutions in the core integrator to a single large work array. This feels
>>> worse style-wise since the function has now side effects, but the runtime
>>> is down to *1.7e-4s*.
>>>
>>> According to the sampling profiler the remaing bottleneck is the
>>> allocation of the array that is returned by the callback function. I was
>>> thinking about returning a tuple instead but that seems rather unpractical
>>> because the data will be written to the work array afterwards. Other ideas?
>>>
>>>
>>> 2013/12/18 Tim Holy <[email protected]>
>>>
>>>  Also, I just added a new section
>>>> http://docs.julialang.org/en/latest/manual/performance-tips/#tools
>>>> that advertises the available tools for helping you diagnose performance
>>>> problems.
>>>>
>>>> Without taking the time to look at your code, I'll just add that
>>>> whenever I
>>>> see an orders-of-magnitude discrepancy between C/Fortran and Julia, my
>>>> first
>>>> instinct is to suspect a type problem. The fact that a vectorized
>>>> version is a
>>>> bit faster than one written with loops might also support this
>>>> diagnosis.
>>>>
>>>> Best,
>>>> --Tim
>>>>
>>>> On Wednesday, December 18, 2013 03:26:11 AM Ivar Nesje wrote:
>>>> > My first suggestion to anyone trying to write fast Julia programs is
>>>> to
>>>> > read http://docs.julialang.org/en/latest/manual/performance-tips/,
>>>> those
>>>> > are all good tips that I do not think will get obsolete when Julia
>>>> > improves. It seems to me like you know those points.
>>>> >
>>>> > I think you get an important hint from the fact that devectorization
>>>> does
>>>> > not matter. To me it seems like the current bottleneck is because you
>>>> use a
>>>> > anonymous function instead of a regular function. When I replace "f("
>>>> by
>>>> > "gravity(" i get some improvement and then your devectorisation
>>>> attempts
>>>> > makes significant difference. Further you might want to try to reduce
>>>> the
>>>> > amount of memory allocated, but that seems to complicate your code
>>>> quite
>>>> > much.
>>>> >
>>>> > My improvements reduces the timing as follows for 1000 iterations.
>>>> > ivar@Ivar-ubuntu:~/tmp$ julia doptest.jl
>>>> > elapsed time: 0.878398771 seconds (513399840 bytes allocated)
>>>> > ivar@Ivar-ubuntu:~/tmp$ julia dopitest.jl
>>>> > elapsed time: 0.16916126 seconds (122423840 bytes allocated)
>>>> >
>>>> > kl. 11:07:30 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn
>>>> følgende:
>>>> > > Hi,
>>>> > >
>>>> > > I spent the last few days porting the well known
>>>> > > DOP853<http://www.unige.ch/~hairer/software.html>integrator to
>>>> Julia. The
>>>> > > process was quite smooth and I have implemented the core
>>>> functionality.
>>>> > > However when I run my reference case, a numerical solution of the
>>>> two-body
>>>> > >
>>>> > > problem, I get the following timings:
>>>> > >    - *Fortran* (gfortran 4.8.2 no optimizations): *~1.7e-5s*
>>>> > >    - *Julia* (master, looped): *~1.3e-3s*
>>>> > >
>>>> > >    - *Julia* (master, vectorized):
>>>> > > *~1e-3s (!) *
>>>> > >
>>>> > > I have posted the Julia code and the Fortran reference in this
>>>> > > Gist<https://gist.github.com/helgee/8019521>. The computationally
>>>> > > expensive part seems to be contained in the *dopcore *or *dopcorevec
>>>> > > *function, respectively. What I really do not understand is, why the
>>>> > > vectorized expressions seem to run faster or rather what I am doing
>>>> wrong
>>>> > > here.
>>>> > >
>>>> > > Any ideas or suggestions? Many thanks in advance!
>>>> > > Helge
>>>>
>>>
>>>
>>

Reply via email to