If you're using Julia 0.4 you can also use call overloading, which is almost as convenient as closures and as fast as const globals. An extension of Tomas's benchmark <https://gist.github.com/afniedermayer/58ca3ba2c418617d85b4> gives me this:
Non-const global 0.027968 seconds (999.84 k allocations: 15.256 MB) Const global 0.000003 seconds (6 allocations: 192 bytes) Closure 0.000015 seconds (19 allocations: 928 bytes) Call Overloading 0.000003 seconds (6 allocations: 192 bytes) I'm not sure whether the current version of Optim.jl and other libraries already support callbacks with call overloading. On Tuesday, September 29, 2015 at 11:11:42 AM UTC+2, Tomas Lycken wrote: > > Kristoffer, your example still uses a *non-const* global. Using this > benchmark <https://gist.github.com/tlycken/f0dc12b7c211f73f6cc1>, I get > the following results: > > Non-const global > 0.017629 seconds (999.84 k allocations: 15.256 MB) > Const global > 0.000002 seconds (6 allocations: 192 bytes) > Closure > 0.000008 seconds (19 allocations: 928 bytes) > > So yes, a closure is much better than using a non-const global, but using > a const global still beats it. However, both are fast enough that it > probably won’t be the bottleneck in your application anymore :) > > Although I can by no means read and understand it, it is also instructive > to look at the dumps from @code_llvm to get a picture of the difference > in code complexity: > > Const global: > > julia> @code_llvm f_const_glob(10^5) > > define i64 @julia_f_const_glob_2617(i64) { > top: > %1 = icmp slt i64 %0, 1 > br i1 %1, label %L3, label %L.preheader > > L.preheader: ; preds = %top > %2 = icmp sgt i64 %0, 0 > %.op = mul i64 %0, 3 > %3 = select i1 %2, i64 %.op, i64 0 > br label %L3 > > L3: ; preds = %L.preheader, %top > %s.1 = phi i64 [ 0, %top ], [ %3, %L.preheader ] > ret i64 %s.1 > } > > Closure: > > julia> @code_llvm f_clos(10^5, 3) > > define %jl_value_t* @julia_f_clos_2616(i64, i64) { > top: > %2 = alloca [5 x %jl_value_t*], align 8 > %.sub = getelementptr inbounds [5 x %jl_value_t*]* %2, i64 0, i64 0 > %3 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 2 > %4 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 3 > store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, > align 8 > %5 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 1 > %6 = load %jl_value_t*** @jl_pgcstack, align 8 > %.c = bitcast %jl_value_t** %6 to %jl_value_t* > store %jl_value_t* %.c, %jl_value_t** %5, align 8 > store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8 > store %jl_value_t* null, %jl_value_t** %3, align 8 > store %jl_value_t* null, %jl_value_t** %4, align 8 > %7 = getelementptr [5 x %jl_value_t*]* %2, i64 0, i64 4 > store %jl_value_t* null, %jl_value_t** %7, align 8 > %8 = load %jl_value_t** inttoptr (i64 2147534600 to %jl_value_t**), align 8 > store %jl_value_t* %8, %jl_value_t** %4, align 8 > %9 = load %jl_value_t** inttoptr (i64 2164502072 to %jl_value_t**), align 8 > store %jl_value_t* %9, %jl_value_t** %7, align 8 > %10 = call %jl_value_t* @jl_f_instantiate_type(%jl_value_t* null, > %jl_value_t** %4, i32 2) > store %jl_value_t* %10, %jl_value_t** %4, align 8 > %11 = call %jl_value_t* @jl_f_svec(%jl_value_t* null, %jl_value_t** null, > i32 0) > store %jl_value_t* %11, %jl_value_t** %7, align 8 > %12 = call %jl_value_t* @jl_f_svec(%jl_value_t* null, %jl_value_t** %4, i32 > 2) > store %jl_value_t* %12, %jl_value_t** %4, align 8 > %13 = call %jl_value_t* @jl_box_int64(i64 signext %1) > store %jl_value_t* %13, %jl_value_t** %7, align 8 > %14 = call %jl_value_t* (i64, ...)* @jl_svec(i64 1, %jl_value_t* %13) > store %jl_value_t* %14, %jl_value_t** %7, align 8 > %15 = call %jl_value_t* @jl_new_closure(i8* null, %jl_value_t* %14, > %jl_value_t* inttoptr (i64 2221717744 to %jl_value > _t*)) > store %jl_value_t* %15, %jl_value_t** %7, align 8 > %16 = load %jl_value_t** @jl_false, align 8 > %17 = call %jl_value_t* @jl_method_def(%jl_value_t* inttoptr (i64 507171808 > to %jl_value_t*), %jl_value_t** %3, %jl_va > lue_t* null, %jl_value_t* null, %jl_value_t* %12, %jl_value_t* %15, > %jl_value_t* %16, %jl_value_t* inttoptr (i64 2179532 > 144 to %jl_value_t*), i32 0) > %18 = load %jl_value_t** %3, align 8 > %19 = bitcast %jl_value_t* %18 to %jl_value_t* (%jl_value_t*, > %jl_value_t**, i32)** > %20 = load %jl_value_t* (%jl_value_t*, %jl_value_t**, i32)** %19, align 8 > %21 = call %jl_value_t* @jl_box_int64(i64 signext %0) > store %jl_value_t* %21, %jl_value_t** %4, align 8 > %22 = call %jl_value_t* %20(%jl_value_t* %18, %jl_value_t** %4, i32 1) > %23 = load %jl_value_t** %5, align 8 > %24 = getelementptr inbounds %jl_value_t* %23, i64 0, i32 0 > store %jl_value_t** %24, %jl_value_t*** @jl_pgcstack, align 8 > ret %jl_value_t* %22 > } > > On Tuesday, September 29, 2015 at 9:18:29 AM UTC+2, Kristoffer Carlsson > wrote: > > In my experience a closure is much faster than a function that accesses >> global variables even if it is passed to another function as an argument. >> >> Two different examples. Here we pass a function that accesses (non const) >> globals to another function: >> >> a = 3 >> >> g(f::Function, b, N) = f(b, N) >> >> function f_glob(b, N) >> s = 0 >> for i = 1:N >> s += a >> end >> return s >> end >> >> @time g(f_glob, 3, 10^6) >> >> 3000000 >> >> 0.020655 seconds (999.84 k allocations: 15.256 MB, 18.40% gc time) >> >> >> Here we instead use an outer function that generates the desired closure >> and pass the closure as the function argument: >> >> function f_clos(b, N, a) >> function f_inner(b, N) >> s = 0 >> for i = 1:N >> s += a >> end >> return s >> end >> >> g(f_inner) >> end >> >> @time f_clos(3, 10^6, 2) >> >> 2000000 >> >> 0.000021 seconds (19 allocations: 928 bytes) >> >> >> >> >> On Tuesday, September 29, 2015 at 8:47:06 AM UTC+2, Tomas Lycken wrote: >>> >>> The performance penalty from global variables comes from the fact that >>> non-const global variables are type unstable, which means that the compiler >>> has to generate very defensive, and thus slow, code. The same thing is, for >>> now, true also for anonymous functions and functions passed as parameters >>> to other functions, so switching one paradigm for the other will >>> unfortunately not help much for performance. I haven't looked closely at >>> your code, but the [FastAnonymous.jl]( >>> https://github.com/timholy/FastAnonymous.jl) package can probably help >>> with making the anonymous function version faster. >>> >>> If the data you want to pass along doesn't change, using *const* globals >>> is fine (at least from a performance perspective, you might have other >>> reasons to avoid them too...), i.e. you can do the following and it will >>> also be faster: >>> >>> ``` >>> const data = 3 >>> >>> function model(x) >>> # use data somehow >>> end >>> >>> # use the model function in your optimization >>> ``` >>> >>> If you need the data to change, you can still use a `const` global >>> variable, but set it to some mutable (and type-stable!) object, for example >>> >>> ``` >>> type MyData >>> data::Int >>> end >>> >>> const data = MyData(3) >>> >>> function model(x) >>> data.data += 1 >>> # ... >>> end >>> >>> # use the model function in your optimization >>> # it can now modify the global data on each invocation >>> ``` >>> >>> This doesn't necessarily provide the cleanest interface, but it should >>> be more performant than your current solution. >>> >>> // T >>> >>> >>> On Monday, September 28, 2015 at 6:53:22 PM UTC+2, Christopher Fisher >>> wrote: >>>> >>>> Thanks for your willingness to investigate the matter more closely. I >>>> cannot post the exact code I am using (besides its rather long). However, >>>> I >>>> posted a toy example that follows the same basic operations. Essentially, >>>> my code involves an outer function (SubLoop) that loops through a data set >>>> with multiple subjects. The model is fit to each subject's data. The other >>>> function (LogLike) computes the log likelihood and is called by optimize. >>>> The first set of code corresponds to the closure method and the second set >>>> of code corresponds to the global variable method. In both cases, the code >>>> executed in about .85 seconds over several runs on my computer and has >>>> about 1.9% gc time. I'm also aware that my code is probably not optimized >>>> in other regards. So I would be receptive to any other advice you might >>>> have. >>>> >>>> >>>> >>>> >>>> using Distributions,Optim >>>> >>>> function SubLoop1(data1) >>>> >>>> function LogLike1(parms) >>>> >>>> L = pdf(Normal(parms[1],exp(parms[2])),SubData) >>>> >>>> LL = -sum(log(L)) >>>> >>>> end >>>> >>>> #Number of Subjects >>>> >>>> Nsub = size(unique(data1[:,1],1),1) >>>> >>>> #Initialize per subject Data >>>> >>>> SubData = [] >>>> >>>> for i = 1:Nsub >>>> >>>> idx = data1[:,1] .== i >>>> >>>> SubData = data1[idx,2] >>>> >>>> parms0 = [1.0;1.0] >>>> >>>> optimize(LogLike1,parms0,method=:nelder_mead) >>>> >>>> end >>>> >>>> end >>>> >>>> >>>> >>>> N = 10^5 >>>> >>>> #Column 1 subject index, column 2 value >>>> >>>> Data = zeros(N*2,2) >>>> >>>> for sub = 1:2 >>>> >>>> Data[(N*(sub-1)+1):(N*sub),:] = [sub*ones(N) rand(Normal(10,2),N)] >>>> >>>> end >>>> >>>> @time SubLoop1(Data) >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Using Distributions, Optim >>>> >>>> function SubLoop2(data1) >>>> >>>> global SubData >>>> >>>> #Number of subjects >>>> >>>> Nsub = size(unique(data1[:,1],1),1) >>>> >>>> #Initialize per subject data >>>> >>>> SubData = [] >>>> >>>> for i = 1:Nsub >>>> >>>> idx = data1[:,1] .== i >>>> >>>> SubData = data1[idx,2] >>>> >>>> parms0 = [1.0;1.0] >>>> >>>> optimize(LogLike2,parms0,method=:nelder_mead) >>>> >>>> end >>>> >>>> end >>>> >>>> >>>> >>>> function LogLike2(parms) >>>> >>>> L = pdf(Normal(parms[1],exp(parms[2])),SubData) >>>> >>>> LL = -sum(log(L)) >>>> >>>> end >>>> >>>> >>>> >>>> N = 10^5 >>>> >>>> #Column 1 subject index, column 2 value >>>> >>>> Data = zeros(N*2,2) >>>> >>>> for sub = 1:2 >>>> >>>> Data[(N*(sub-1)+1):(N*sub),:] = [sub*ones(N) rand(Normal(10,2),N)] >>>> >>>> end >>>> >>>> @time SubLoop2(Data) >>>> >>>> >>>> >>>> On Monday, September 28, 2015 at 11:24:13 AM UTC-4, Kristoffer Carlsson >>>> wrote: >>>>> >>>>> From only that comment alone it is hard to give any further advice. >>>>> >>>>> What overhead are you seeing? >>>>> >>>>> Posting runnable code is the best way to get help. >>>>> >>>>> >
