Thanks so much for the tips. The culprit is the keyword argument
(xRat=0.). Declaring it made the wrapped code twice as fast, but still way
slower than the inline code. But making it positional made the wrapped
code just a little slower than the inline code - big improvement.
On Wednesday, September 28, 2016 at 2:50:40 PM UTC+8, Gunnar Farnebäck
wrote:
>
> It's normal that manually inlined code of this kind is faster than wrapped
> code unless the compiler manages to see the full inlining potential. In
> this case the huge memory allocations for the wrapped solutions indicates
> that it's nowhere near doing that at all. I doubt it will take you all the
> way but start with modifying your inner M_CPS function to only take
> positional arguments or declaring the type of the keyword argument as
> suggested in the performance tips section of the manual.
>
> Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo:
>>
>> I tested a few different ways of wrapping functions. It looks different
>> ways of wrapping has slightly different costs. But the most confusing to
>> me is that putting everything inline looks much faster than wrapping things
>> up. I would understand this in other languages, but I thought Julia
>> advocates simple wrapping. Can anyone help explain what is happening
>> below, and how I can do most efficient wrapping in the demo code?
>>
>> Demo code is included below.
>>
>> julia> versioninfo()
>> Julia Version 0.5.0
>> Commit 3c9d753 (2016-09-19 18:14 UTC)
>> Platform Info:
>> System: Linux (x86_64-pc-linux-gnu)
>> CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
>> WORD_SIZE: 64
>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>> LAPACK: libopenblas64_
>> LIBM: libopenlibm
>> LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>
>> julia> testFunc()
>> calling LoopCP (everything inline)
>> 0.097556 seconds (2.10 k allocations: 290.625 KB)
>> elapsed time (ns): 97555896
>> bytes allocated: 297600
>> pool allocs: 2100
>> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0]
>>
>> calling LoopCP0 (slightly wrapped)
>> 4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time)
>> elapsed time (ns): 4173830495
>> gc time (ns): 243516584
>> bytes allocated: 2396838538
>> pool allocs: 49783357
>> GC pauses: 104
>> full collections: 1
>> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0]
>>
>> calling LoopCP1 (wrapped one way)
>> 5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time)
>> elapsed time (ns): 5274722983
>> gc time (ns): 191036337
>> bytes allocated: 2553752638
>> pool allocs: 59585834
>> GC pauses: 112
>> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0]
>>
>> calling LoopCP2 (wrapped another way)
>> 5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time)
>> elapsed time (ns): 5212894550
>> gc time (ns): 187696529
>> bytes allocated: 2553577600
>> pool allocs: 59582100
>> GC pauses: 111
>> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0]
>>
>> const dim=1000
>>>
>>>
>>>> type Tech
>>>
>>> a::Array{Float64,1}
>>>
>>> c::Array{Int,1}
>>>
>>>
>>>> function Tech()
>>>
>>> this = new()
>>>
>>> this.a = zeros(Float64, dim)
>>>
>>> this.c = rand([0,1;], dim)
>>>
>>> this
>>>
>>> end
>>>
>>> end
>>>
>>>
>>>> function LoopCP(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.)
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> function LoopCP0(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> M_CPS(i, csign, tech, xRat=xRat)
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function,
>>>> xRat=0.)
>>>
>>> @inbounds for i = 1:dim
>>>
>>> molecule(i, csign, tech; xRat=xRat)
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> MoleculeWrapS(csign, tech, molecule, xRat)
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS)
>>>
>>>
>>>> WrapCPS(csign::Int, tech::Tech, xRat=0.) = MoleculeWrapS(csign, tech,
>>>> M_CPS, xRat)
>>>
>>>
>>>> function LoopRunnerM2(csign::Int, tech::Tech, loop::Function)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> loop(csign, tech, xRat)
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>> LoopCP2(csign::Int, tech::Tech) = LoopRunnerM2(csign, tech, WrapCPS)
>>>
>>>
>>>> function testFunc()
>>>
>>> tech = Tech()
>>>
>>> nloops = 100
>>>
>>>
>>>> println("calling LoopCP (everything inline)")
>>>
>>> tech.a = zeros(tech.a)
>>>
>>> @timev for i=1:nloops
>>>
>>> LoopCP(rand([0,1]), tech)
>>>
>>> end
>>>
>>> println(tech.a[10:20], "\n")
>>>
>>>
>>>> println("calling LoopCP0 (slightly wrapped)")
>>>
>>> tech.a = zeros(tech.a)
>>>
>>> @timev for i=1:nloops
>>>
>>> LoopCP0(rand([0,1]), tech)
>>>
>>> end
>>>
>>> println(tech.a[10:20], "\n")
>>>
>>>
>>>> println("calling LoopCP1 (wrapped one way)")
>>>
>>> tech.a = zeros(tech.a)
>>>
>>> @timev for i=1:nloops
>>>
>>> LoopCP1(rand([0,1]), tech)
>>>
>>> end
>>>
>>> println(tech.a[10:20], "\n")
>>>
>>>
>>>> println("calling LoopCP2 (wrapped another way)")
>>>
>>> tech.a = zeros(tech.a)
>>>
>>> @timev for i=1:nloops
>>>
>>> LoopCP2(rand([0,1]), tech)
>>>
>>> end
>>>
>>> println(tech.a[10:20], "\n")
>>>
>>>
>>>
>>> nothing
>>>
>>> end
>>>
>>>
>>>