Kristoffer, your example still uses a *non-const* global. Using this
benchmark <https://gist.github.com/tlycken/f0dc12b7c211f73f6cc1>, I get the
following results:
Non-const global
0.017629 seconds (999.84 k allocations: 15.256 MB)
Const global
0.000002 seconds (6 allocations: 192 bytes)
Closure
0.000008 seconds (19 allocations: 928 bytes)
So yes, a closure is much better than using a non-const global, but using a
const global still beats it. However, both are fast enough that it probably
won’t be the bottleneck in your application anymore :)
Although I can by no means read and understand it, it is also instructive
to look at the dumps from @code_llvm to get a picture of the difference in
code complexity:
Const global:
julia> @code_llvm f_const_glob(10^5)
define i64 @julia_f_const_glob_2617(i64) {
top:
%1 = icmp slt i64 %0, 1
br i1 %1, label %L3, label %L.preheader
L.preheader: ; preds = %top
%2 = icmp sgt i64 %0, 0
%.op = mul i64 %0, 3
%3 = select i1 %2, i64 %.op, i64 0
br label %L3
L3: ; preds = %L.preheader, %top
%s.1 = phi i64 [ 0, %top ], [ %3, %L.preheader ]
ret i64 %s.1
}
Closure:
julia> @code_llvm f_clos(10^5, 3)
define %jl_value_t* @julia_f_clos_2616(i64, i64) {
top:
%2 = alloca [5 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [5 x %jl_value_t*]* %2, i64 0, i64 0
%3 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 2
%4 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 3
store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub,
align 8
%5 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 1
%6 = load %jl_value_t*** @jl_pgcstack, align 8
%.c = bitcast %jl_value_t** %6 to %jl_value_t*
store %jl_value_t* %.c, %jl_value_t** %5, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %3, align 8
store %jl_value_t* null, %jl_value_t** %4, align 8
%7 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %7, align 8
%8 = load %jl_value_t** inttoptr (i64 2147534600 to %jl_value_t**), align 8
store %jl_value_t* %8, %jl_value_t** %4, align 8
%9 = load %jl_value_t** inttoptr (i64 2164502072 to %jl_value_t**), align 8
store %jl_value_t* %9, %jl_value_t** %7, align 8
%10 = call %jl_value_t* @jl_f_instantiate_type(%jl_value_t* null,
%jl_value_t** %4, i32 2)
store %jl_value_t* %10, %jl_value_t** %4, align 8
%11 = call %jl_value_t* @jl_f_svec(%jl_value_t* null, %jl_value_t** null, i32
0)
store %jl_value_t* %11, %jl_value_t** %7, align 8
%12 = call %jl_value_t* @jl_f_svec(%jl_value_t* null, %jl_value_t** %4, i32 2)
store %jl_value_t* %12, %jl_value_t** %4, align 8
%13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
store %jl_value_t* %13, %jl_value_t** %7, align 8
%14 = call %jl_value_t* (i64, ...)* @jl_svec(i64 1, %jl_value_t* %13)
store %jl_value_t* %14, %jl_value_t** %7, align 8
%15 = call %jl_value_t* @jl_new_closure(i8* null, %jl_value_t* %14,
%jl_value_t* inttoptr (i64 2221717744 to %jl_value
_t*))
store %jl_value_t* %15, %jl_value_t** %7, align 8
%16 = load %jl_value_t** @jl_false, align 8
%17 = call %jl_value_t* @jl_method_def(%jl_value_t* inttoptr (i64 507171808
to %jl_value_t*), %jl_value_t** %3, %jl_va
lue_t* null, %jl_value_t* null, %jl_value_t* %12, %jl_value_t* %15,
%jl_value_t* %16, %jl_value_t* inttoptr (i64 2179532
144 to %jl_value_t*), i32 0)
%18 = load %jl_value_t** %3, align 8
%19 = bitcast %jl_value_t* %18 to %jl_value_t* (%jl_value_t*, %jl_value_t**,
i32)**
%20 = load %jl_value_t* (%jl_value_t*, %jl_value_t**, i32)** %19, align 8
%21 = call %jl_value_t* @jl_box_int64(i64 signext %0)
store %jl_value_t* %21, %jl_value_t** %4, align 8
%22 = call %jl_value_t* %20(%jl_value_t* %18, %jl_value_t** %4, i32 1)
%23 = load %jl_value_t** %5, align 8
%24 = getelementptr inbounds %jl_value_t* %23, i64 0, i32 0
store %jl_value_t** %24, %jl_value_t*** @jl_pgcstack, align 8
ret %jl_value_t* %22
}
On Tuesday, September 29, 2015 at 9:18:29 AM UTC+2, Kristoffer Carlsson
wrote:
In my experience a closure is much faster than a function that accesses
> global variables even if it is passed to another function as an argument.
>
> Two different examples. Here we pass a function that accesses (non const)
> globals to another function:
>
> a = 3
>
> g(f::Function, b, N) = f(b, N)
>
> function f_glob(b, N)
> s = 0
> for i = 1:N
> s += a
> end
> return s
> end
>
> @time g(f_glob, 3, 10^6)
>
> 3000000
>
> 0.020655 seconds (999.84 k allocations: 15.256 MB, 18.40% gc time)
>
>
> Here we instead use an outer function that generates the desired closure
> and pass the closure as the function argument:
>
> function f_clos(b, N, a)
> function f_inner(b, N)
> s = 0
> for i = 1:N
> s += a
> end
> return s
> end
>
> g(f_inner)
> end
>
> @time f_clos(3, 10^6, 2)
>
> 2000000
>
> 0.000021 seconds (19 allocations: 928 bytes)
>
>
>
>
> On Tuesday, September 29, 2015 at 8:47:06 AM UTC+2, Tomas Lycken wrote:
>>
>> The performance penalty from global variables comes from the fact that
>> non-const global variables are type unstable, which means that the compiler
>> has to generate very defensive, and thus slow, code. The same thing is, for
>> now, true also for anonymous functions and functions passed as parameters
>> to other functions, so switching one paradigm for the other will
>> unfortunately not help much for performance. I haven't looked closely at
>> your code, but the [FastAnonymous.jl](
>> https://github.com/timholy/FastAnonymous.jl) package can probably help
>> with making the anonymous function version faster.
>>
>> If the data you want to pass along doesn't change, using *const* globals
>> is fine (at least from a performance perspective, you might have other
>> reasons to avoid them too...), i.e. you can do the following and it will
>> also be faster:
>>
>> ```
>> const data = 3
>>
>> function model(x)
>> # use data somehow
>> end
>>
>> # use the model function in your optimization
>> ```
>>
>> If you need the data to change, you can still use a `const` global
>> variable, but set it to some mutable (and type-stable!) object, for example
>>
>> ```
>> type MyData
>> data::Int
>> end
>>
>> const data = MyData(3)
>>
>> function model(x)
>> data.data += 1
>> # ...
>> end
>>
>> # use the model function in your optimization
>> # it can now modify the global data on each invocation
>> ```
>>
>> This doesn't necessarily provide the cleanest interface, but it should be
>> more performant than your current solution.
>>
>> // T
>>
>>
>> On Monday, September 28, 2015 at 6:53:22 PM UTC+2, Christopher Fisher
>> wrote:
>>>
>>> Thanks for your willingness to investigate the matter more closely. I
>>> cannot post the exact code I am using (besides its rather long). However, I
>>> posted a toy example that follows the same basic operations. Essentially,
>>> my code involves an outer function (SubLoop) that loops through a data set
>>> with multiple subjects. The model is fit to each subject's data. The other
>>> function (LogLike) computes the log likelihood and is called by optimize.
>>> The first set of code corresponds to the closure method and the second set
>>> of code corresponds to the global variable method. In both cases, the code
>>> executed in about .85 seconds over several runs on my computer and has
>>> about 1.9% gc time. I'm also aware that my code is probably not optimized
>>> in other regards. So I would be receptive to any other advice you might
>>> have.
>>>
>>>
>>>
>>>
>>> using Distributions,Optim
>>>
>>> function SubLoop1(data1)
>>>
>>> function LogLike1(parms)
>>>
>>> L = pdf(Normal(parms[1],exp(parms[2])),SubData)
>>>
>>> LL = -sum(log(L))
>>>
>>> end
>>>
>>> #Number of Subjects
>>>
>>> Nsub = size(unique(data1[:,1],1),1)
>>>
>>> #Initialize per subject Data
>>>
>>> SubData = []
>>>
>>> for i = 1:Nsub
>>>
>>> idx = data1[:,1] .== i
>>>
>>> SubData = data1[idx,2]
>>>
>>> parms0 = [1.0;1.0]
>>>
>>> optimize(LogLike1,parms0,method=:nelder_mead)
>>>
>>> end
>>>
>>> end
>>>
>>>
>>>
>>> N = 10^5
>>>
>>> #Column 1 subject index, column 2 value
>>>
>>> Data = zeros(N*2,2)
>>>
>>> for sub = 1:2
>>>
>>> Data[(N*(sub-1)+1):(N*sub),:] = [sub*ones(N) rand(Normal(10,2),N)]
>>>
>>> end
>>>
>>> @time SubLoop1(Data)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Using Distributions, Optim
>>>
>>> function SubLoop2(data1)
>>>
>>> global SubData
>>>
>>> #Number of subjects
>>>
>>> Nsub = size(unique(data1[:,1],1),1)
>>>
>>> #Initialize per subject data
>>>
>>> SubData = []
>>>
>>> for i = 1:Nsub
>>>
>>> idx = data1[:,1] .== i
>>>
>>> SubData = data1[idx,2]
>>>
>>> parms0 = [1.0;1.0]
>>>
>>> optimize(LogLike2,parms0,method=:nelder_mead)
>>>
>>> end
>>>
>>> end
>>>
>>>
>>>
>>> function LogLike2(parms)
>>>
>>> L = pdf(Normal(parms[1],exp(parms[2])),SubData)
>>>
>>> LL = -sum(log(L))
>>>
>>> end
>>>
>>>
>>>
>>> N = 10^5
>>>
>>> #Column 1 subject index, column 2 value
>>>
>>> Data = zeros(N*2,2)
>>>
>>> for sub = 1:2
>>>
>>> Data[(N*(sub-1)+1):(N*sub),:] = [sub*ones(N) rand(Normal(10,2),N)]
>>>
>>> end
>>>
>>> @time SubLoop2(Data)
>>>
>>>
>>>
>>> On Monday, September 28, 2015 at 11:24:13 AM UTC-4, Kristoffer Carlsson
>>> wrote:
>>>>
>>>> From only that comment alone it is hard to give any further advice.
>>>>
>>>> What overhead are you seeing?
>>>>
>>>> Posting runnable code is the best way to get help.
>>>>
>>>>