Yep, this is a common one, too. Worthy of yet another new section in performance tips:
http://docs.julialang.org/en/latest/manual/performance-tips/#pre-allocating- outputs --Tim On Wednesday, December 18, 2013 03:57:29 PM Helge Eichhorn wrote: > Thanks Ivar and Tom, that did the trick, and the performance is in reach of > Fortran: > > *9.5e-5s* > I will implement the missing functionality and do some cleanup so this can > go into ODE.jl. Thanks for the help everybody! > > > 2013/12/18 Ivar Nesje <[email protected]> > > > I did not intend to suggest that it would be good for the interface to > > hard code the function. I just noted that some overhead vanishes. There > > might be some metaprogramming tricks you can do to get your desired > > interface back. > > > > The usual idea in Julia for returning an array, without allocation is the > > same as in C. You must allocate the array in the calling function and > > modify it inside the function. In juila the convention is that such > > functions end in ! and modify their first argument. > > > > > > function gravity!(ret::Vector{Float64}, t::Float64, y::Vector{Float64}, > > mu::Float64)> > > r = norm(y[1:3]) > > ret[1:3] = y[4:6] > > ret[4:6] = -mu*y[1:3]/r/r/r > > > > end > > > > > > Unfortunately Julia does not support efficient slicing so I was unsure if > > you could gain anything from this approach, becuase you use the returned > > array on multiple occasions later. > > > > kl. 15:09:18 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn følgende: > >> I have updated the > >> Gist<https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhelgee > >> %2F8019521&sa=D&sntz=1&usg=AFQjCNFFC_2GgoDiVMWwe6WpBpKmBv2W5w> . > >> > >> > >> 2013/12/18 Helge Eichhorn <[email protected]> > >> > >>> @Ivar: Using a regular function is only possible in this special case, > >>> because generally the user must supply the callback function. > >>> > >>> @Tim: I think that I can also rule out type problems. At least the > >>> output of code_typed is clean of Unions. Also did I forget to mention > >>> that > >>> I already did a lot of profiling. > >>> > >>> What did help was re-writing the callback function like so: > >>> > >>> function gravity1(t::Float64, y::Vector{Float64}, mu::Float64) > >>> > >>> x, y, z, vx, vy, vz = y > >>> r = sqrt(x*x+y*y+z*z) > >>> r3 = r*r*r > >>> [vx, vy, vz, -mu*x/r3, -mu*y/r3, -mu*z/r3] > >>> > >>> end > >>> > >>> and especially reducing memory allocations by writing the intermediary > >>> solutions in the core integrator to a single large work array. This > >>> feels > >>> worse style-wise since the function has now side effects, but the > >>> runtime > >>> is down to *1.7e-4s*. > >>> > >>> According to the sampling profiler the remaing bottleneck is the > >>> allocation of the array that is returned by the callback function. I was > >>> thinking about returning a tuple instead but that seems rather > >>> unpractical > >>> because the data will be written to the work array afterwards. Other > >>> ideas? > >>> > >>> > >>> 2013/12/18 Tim Holy <[email protected]> > >>> > >>> Also, I just added a new section > >>> > >>>> http://docs.julialang.org/en/latest/manual/performance-tips/#tools > >>>> that advertises the available tools for helping you diagnose > >>>> performance > >>>> problems. > >>>> > >>>> Without taking the time to look at your code, I'll just add that > >>>> whenever I > >>>> see an orders-of-magnitude discrepancy between C/Fortran and Julia, my > >>>> first > >>>> instinct is to suspect a type problem. The fact that a vectorized > >>>> version is a > >>>> bit faster than one written with loops might also support this > >>>> diagnosis. > >>>> > >>>> Best, > >>>> --Tim > >>>> > >>>> On Wednesday, December 18, 2013 03:26:11 AM Ivar Nesje wrote: > >>>> > My first suggestion to anyone trying to write fast Julia programs is > >>>> > >>>> to > >>>> > >>>> > read http://docs.julialang.org/en/latest/manual/performance-tips/, > >>>> > >>>> those > >>>> > >>>> > are all good tips that I do not think will get obsolete when Julia > >>>> > improves. It seems to me like you know those points. > >>>> > > >>>> > I think you get an important hint from the fact that devectorization > >>>> > >>>> does > >>>> > >>>> > not matter. To me it seems like the current bottleneck is because you > >>>> > >>>> use a > >>>> > >>>> > anonymous function instead of a regular function. When I replace "f(" > >>>> > >>>> by > >>>> > >>>> > "gravity(" i get some improvement and then your devectorisation > >>>> > >>>> attempts > >>>> > >>>> > makes significant difference. Further you might want to try to reduce > >>>> > >>>> the > >>>> > >>>> > amount of memory allocated, but that seems to complicate your code > >>>> > >>>> quite > >>>> > >>>> > much. > >>>> > > >>>> > My improvements reduces the timing as follows for 1000 iterations. > >>>> > ivar@Ivar-ubuntu:~/tmp$ julia doptest.jl > >>>> > elapsed time: 0.878398771 seconds (513399840 bytes allocated) > >>>> > ivar@Ivar-ubuntu:~/tmp$ julia dopitest.jl > >>>> > elapsed time: 0.16916126 seconds (122423840 bytes allocated) > >>>> > > >>>> > kl. 11:07:30 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn > >>>> > >>>> følgende: > >>>> > > Hi, > >>>> > > > >>>> > > I spent the last few days porting the well known > >>>> > > DOP853<http://www.unige.ch/~hairer/software.html>integrator to > >>>> > >>>> Julia. The > >>>> > >>>> > > process was quite smooth and I have implemented the core > >>>> > >>>> functionality. > >>>> > >>>> > > However when I run my reference case, a numerical solution of the > >>>> > >>>> two-body > >>>> > >>>> > > problem, I get the following timings: > >>>> > > - *Fortran* (gfortran 4.8.2 no optimizations): *~1.7e-5s* > >>>> > > - *Julia* (master, looped): *~1.3e-3s* > >>>> > > > >>>> > > - *Julia* (master, vectorized): > >>>> > > *~1e-3s (!) * > >>>> > > > >>>> > > I have posted the Julia code and the Fortran reference in this > >>>> > > Gist<https://gist.github.com/helgee/8019521>. The computationally > >>>> > > expensive part seems to be contained in the *dopcore *or > >>>> > > *dopcorevec > >>>> > > *function, respectively. What I really do not understand is, why > >>>> > > the > >>>> > > vectorized expressions seem to run faster or rather what I am doing > >>>> > >>>> wrong > >>>> > >>>> > > here. > >>>> > > > >>>> > > Any ideas or suggestions? Many thanks in advance! > >>>> > > Helge
