Yep, this is a common one, too. Worthy of yet another new section in 
performance tips:

http://docs.julialang.org/en/latest/manual/performance-tips/#pre-allocating-
outputs

--Tim

On Wednesday, December 18, 2013 03:57:29 PM Helge Eichhorn wrote:
> Thanks Ivar and Tom, that did the trick, and the performance is in reach of
> Fortran:
> 
> *9.5e-5s*
> I will implement the missing functionality and do some cleanup so this can
> go into ODE.jl. Thanks for the help everybody!
> 
> 
> 2013/12/18 Ivar Nesje <[email protected]>
> 
> > I did not intend to suggest that it would be good for the interface to
> > hard code the function. I just noted that some overhead vanishes. There
> > might be some metaprogramming tricks you can do to get your desired
> > interface back.
> > 
> > The usual idea in Julia for returning an array, without allocation is the
> > same as in C. You must allocate the array in the calling function and
> > modify it inside the function. In juila the convention is that such
> > functions end in ! and modify their first argument.
> > 
> > 
> > function gravity!(ret::Vector{Float64}, t::Float64, y::Vector{Float64},
> > mu::Float64)> 
> >     r = norm(y[1:3])
> >     ret[1:3] = y[4:6]
> >     ret[4:6] = -mu*y[1:3]/r/r/r
> > 
> > end
> > 
> > 
> > Unfortunately Julia does not support efficient slicing so I was unsure if
> > you could gain anything from this approach, becuase you use the returned
> > array on multiple occasions later.
> > 
> > kl. 15:09:18 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn følgende:
> >> I have updated the
> >> Gist<https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhelgee
> >> %2F8019521&sa=D&sntz=1&usg=AFQjCNFFC_2GgoDiVMWwe6WpBpKmBv2W5w> .
> >> 
> >> 
> >> 2013/12/18 Helge Eichhorn <[email protected]>
> >> 
> >>> @Ivar: Using a regular function is only possible in this special case,
> >>> because generally the user must supply the callback function.
> >>> 
> >>> @Tim: I think that I can also rule out type problems. At least the
> >>> output of code_typed is clean of Unions. Also did I forget to mention
> >>> that
> >>> I already did a lot of profiling.
> >>> 
> >>> What did help was re-writing the callback function like so:
> >>> 
> >>> function gravity1(t::Float64, y::Vector{Float64}, mu::Float64)
> >>> 
> >>>     x, y, z, vx, vy, vz = y
> >>>     r = sqrt(x*x+y*y+z*z)
> >>>     r3 = r*r*r
> >>>     [vx, vy, vz, -mu*x/r3, -mu*y/r3, -mu*z/r3]
> >>> 
> >>> end
> >>> 
> >>> and especially reducing memory allocations by writing the intermediary
> >>> solutions in the core integrator to a single large work array. This
> >>> feels
> >>> worse style-wise since the function has now side effects, but the
> >>> runtime
> >>> is down to *1.7e-4s*.
> >>> 
> >>> According to the sampling profiler the remaing bottleneck is the
> >>> allocation of the array that is returned by the callback function. I was
> >>> thinking about returning a tuple instead but that seems rather
> >>> unpractical
> >>> because the data will be written to the work array afterwards. Other
> >>> ideas?
> >>> 
> >>> 
> >>> 2013/12/18 Tim Holy <[email protected]>
> >>> 
> >>>  Also, I just added a new section
> >>>  
> >>>> http://docs.julialang.org/en/latest/manual/performance-tips/#tools
> >>>> that advertises the available tools for helping you diagnose
> >>>> performance
> >>>> problems.
> >>>> 
> >>>> Without taking the time to look at your code, I'll just add that
> >>>> whenever I
> >>>> see an orders-of-magnitude discrepancy between C/Fortran and Julia, my
> >>>> first
> >>>> instinct is to suspect a type problem. The fact that a vectorized
> >>>> version is a
> >>>> bit faster than one written with loops might also support this
> >>>> diagnosis.
> >>>> 
> >>>> Best,
> >>>> --Tim
> >>>> 
> >>>> On Wednesday, December 18, 2013 03:26:11 AM Ivar Nesje wrote:
> >>>> > My first suggestion to anyone trying to write fast Julia programs is
> >>>> 
> >>>> to
> >>>> 
> >>>> > read http://docs.julialang.org/en/latest/manual/performance-tips/,
> >>>> 
> >>>> those
> >>>> 
> >>>> > are all good tips that I do not think will get obsolete when Julia
> >>>> > improves. It seems to me like you know those points.
> >>>> > 
> >>>> > I think you get an important hint from the fact that devectorization
> >>>> 
> >>>> does
> >>>> 
> >>>> > not matter. To me it seems like the current bottleneck is because you
> >>>> 
> >>>> use a
> >>>> 
> >>>> > anonymous function instead of a regular function. When I replace "f("
> >>>> 
> >>>> by
> >>>> 
> >>>> > "gravity(" i get some improvement and then your devectorisation
> >>>> 
> >>>> attempts
> >>>> 
> >>>> > makes significant difference. Further you might want to try to reduce
> >>>> 
> >>>> the
> >>>> 
> >>>> > amount of memory allocated, but that seems to complicate your code
> >>>> 
> >>>> quite
> >>>> 
> >>>> > much.
> >>>> > 
> >>>> > My improvements reduces the timing as follows for 1000 iterations.
> >>>> > ivar@Ivar-ubuntu:~/tmp$ julia doptest.jl
> >>>> > elapsed time: 0.878398771 seconds (513399840 bytes allocated)
> >>>> > ivar@Ivar-ubuntu:~/tmp$ julia dopitest.jl
> >>>> > elapsed time: 0.16916126 seconds (122423840 bytes allocated)
> >>>> > 
> >>>> > kl. 11:07:30 UTC+1 onsdag 18. desember 2013 skrev Helge Eichhorn
> >>>> 
> >>>> følgende:
> >>>> > > Hi,
> >>>> > > 
> >>>> > > I spent the last few days porting the well known
> >>>> > > DOP853<http://www.unige.ch/~hairer/software.html>integrator to
> >>>> 
> >>>> Julia. The
> >>>> 
> >>>> > > process was quite smooth and I have implemented the core
> >>>> 
> >>>> functionality.
> >>>> 
> >>>> > > However when I run my reference case, a numerical solution of the
> >>>> 
> >>>> two-body
> >>>> 
> >>>> > > problem, I get the following timings:
> >>>> > >    - *Fortran* (gfortran 4.8.2 no optimizations): *~1.7e-5s*
> >>>> > >    - *Julia* (master, looped): *~1.3e-3s*
> >>>> > > 
> >>>> > >    - *Julia* (master, vectorized):
> >>>> > > *~1e-3s (!) *
> >>>> > > 
> >>>> > > I have posted the Julia code and the Fortran reference in this
> >>>> > > Gist<https://gist.github.com/helgee/8019521>. The computationally
> >>>> > > expensive part seems to be contained in the *dopcore *or
> >>>> > > *dopcorevec
> >>>> > > *function, respectively. What I really do not understand is, why
> >>>> > > the
> >>>> > > vectorized expressions seem to run faster or rather what I am doing
> >>>> 
> >>>> wrong
> >>>> 
> >>>> > > here.
> >>>> > > 
> >>>> > > Any ideas or suggestions? Many thanks in advance!
> >>>> > > Helge

Reply via email to