Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Thibaut Lamadon Sun, 22 Jun 2014 09:14:16 -0700

Hi Tim 

is this a concern even-though I declare u1::Float64 = 0; at the beginning 
of the function, in ll2?


t.

On Sunday, 22 June 2014 15:57:53 UTC+1, Tim Holy wrote:
>
> If x1, ..., x6 or coeff are Float64 arrays, then the initialization 
>
>     u1 = 0; u2 = 0; u3 = 0; u4 = 0; u5 = 0; u6 = 0 
>
> is problematic as soon as you get to 
>
>         for k=1:nVar 
>             u1 += x1[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>             u2 += x2[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>             u3 += x3[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>             u4 += x4[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>             u5 += x5[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>             u6 += x6[i + ni*( k-1 + nk* (t-1))]*coeff[k] 
>         end 
>
> because you're initializing them to be integers but then they get 
> converted 
> into Float64. A more careful approach is to do something like this: 
>
>     T = typeof(one(eltype(x1))*one(eltype(coeff)) 
>     TT = typeof(one(T) + one(T)) 
>     u1 = u2 = u3 = u4 = u5 = u6 = zero(TT) 
>
> In general, code_typed is your friend: look for Union types. 
>
>     T = Vector{Float64} 
>    code_typed(ll, (T, T, T, T, T, T, T, T, T, T, T, T, T, Float64, Int, 
> Int)) 
>
> and you'll see Union types all over the place. (TypeCheck also, but it 
> didn't 
> seem to pick up this error.) And see the Performance Tips section of the 
> manual. 
>
> --Tim 
>
>
> On Sunday, June 22, 2014 04:50:16 AM Thibaut Lamadon wrote: 
> > Hi guys, 
> > 
> > I wanted to look into this as well. The main issue I think is in the 
> speed 
> > of the objective function. Running @time on the objective function 
> > suggested a large amount of byte allocation. Checking the type revealed 
> > that getting x and y from data would set their types to Any. 
> > 
> > So I convert the data to {Float64,3}, and then I changed to only store 
> > cumulative sum, not the vector of likelihood. I run the objective 
> function 
> > 50 times. 
> > 
> > without the convert I get a total time of 9.18s 
> > with the convert and original function I get 4.15s 
> > with the convert and the new function I get 1.49s 
> > with matlab, I get 0.64s 
> > 
> > so matlab still appears to be 2.5 times faster. But I am guessing matlab 
> is 
> > using SIMD instructions when computing matrix multiplications. So we 
> would 
> > need to try to use BLAS in julia with matrix multiplication to get a 
> very 
> > good comparison. 
> > 
> > Anyway, fixing the type of the input, and just summing inside the loop 
> > gives a 6x speed up. 
> > 
> > PS: Running the full optimization 
> > with the convert and the new function I get 4.8s 
> > with my matlab I get 4s 
> > 
> > I could not commit to Holger gist, so I forked 
> > it: https://gist.github.com/tlamadon/58c47c115f8cf2388e89 
> > please check that I have not done anything stupid, but output seemed 
> > similar. 
> > 
> > Holger, I hope you are having a good time at home (or in Paris?),  And a 
> > world cup note: Allez les bleus! 
> > 
> > very best, 
> > 
> > t. 
> > 
> > On Tuesday, 27 May 2014 14:03:30 UTC+1, Holger Stichnoth wrote: 
> > > Hi John, hi Miles, 
> > > 
> > > Thanks to both of you. I did not have time to look into this over the 
> > > weekend; I will do so in the next couple of days. I have already 
> uploaded 
> > > the Matlab files for comparison: 
> > > https://gist.github.com/stichnoth/7f251ded83dcaa384273 
> > > 
> > > Holger 
> > > 
> > > On Thursday, 22 May 2014 23:03:58 UTC+1, John Myles White wrote: 
> > >> Yeah, this case is tricky enough that we really need to get down to 
> the 
> > >> lowest details: 
> > >> 
> > >> (1) Do Julia and Matlab perform similar numbers of function 
> evaluations? 
> > >> 
> > >> (2) If they don't perform similar numbers of function evaluations, is 
> one 
> > >> of them producing a better solution? Is the one that's producing a 
> better 
> > >> solution doing more function evaluations? 
> > >> 
> > >> (3) If they're doing different numbers of function evaluations and 
> the 
> > >> one that does _fewer_ evaluations also produces a better solution, 
> what's 
> > >> the reason? For example, is our line search default less effective 
> for 
> > >> this 
> > >> problem than the Matlab line search? If you try other line search 
> > >> algorithms, do the results stay the same? 
> > >> 
> > >> Unfortunately, answering all of these reliably make take us all 
> pretty 
> > >> far down the rabbit hole. But they're worth pushing on 
> systematically. 
> > >> 
> > >>  -- John 
> > >> 
> > >> On May 22, 2014, at 2:59 PM, Miles Lubin <[email protected]> wrote: 
> > >> 
> > >> I can get another 50% speedup by: 
> > >> 
> > >> - Running the optimization twice and timing the second run only, this 
> is 
> > >> the more appropriate way to benchmark julia because it excludes the 
> > >> function compilation time 
> > >> - Setting autodiff=true 
> > >> - Breaking up the long chains of sums, apparently these seem to be 
> slow 
> > >> 
> > >> At this point one really needs to compare the number of function 
> > >> evaluations in each method, as John suggested. 
> > >> 
> > >> On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote: 
> > >>> Thanks, it's faster now (by roughly a factor of 3 on my computer), 
> but 
> > >>> still considerably slower than fminunc: 
> > >>> 
> > >>> Averages over 20 runs: 
> > >>> Julia/Optim.optimize: 10.5s 
> > >>> Matlab/fminunc: 2.6s 
> > >>> 
> > >>> Here are my Matlab settings: 
> > >>> options = optimset('Display', 'iter', ... 
> > >>> 
> > >>>      'MaxIter', 2500, 'MaxFunEvals', 500000, ... 
> > >>>      'TolFun', 1e-6, 'TolX', 1e-6, ... 
> > >>>      'GradObj', 'off', 'DerivativeCheck', 'off'); 
> > >>> 
> > >>> startb      = ones(1,nVar)'; 
> > >>> [estim_clo, ll_clo]= ... 
> > >>> 
> > >>>      fminunc(@(param)clogit_ll(param,data), ... 
> > >>>      startb,options); 
> > >>> 
> > >>> Could the speed issue be related to the following messages that I 
> get 
> > >>> when I run the Julia code? 
> > >>> C:\Users\User\Documents\References\Software\Julia\mlubin>julia 
> main.jl 
> > >>> Warning: could not import Base.foldl into NumericExtensions 
> > >>> Warning: could not import Base.foldr into NumericExtensions 
> > >>> Warning: could not import Base.sum! into NumericExtensions 
> > >>> Warning: could not import Base.maximum! into NumericExtensions 
> > >>> Warning: could not import Base.minimum! into NumericExtensions 
> > >>> 
> > >>> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin: 
> > >>>> I was able to get a nearly 5x speedup by avoiding the matrix 
> allocation 
> > >>>> and making the accumulators type stable: 
> > >>>> https://gist.github.com/mlubin/055690ddf2466e98bba6 
> > >>>> 
> > >>>> How does this compare with Matlab now? 
> > >>>> 
> > >>>> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote: 
> > >>>>> @ John: You are right, when I specify the function as 
> > >>>>> clogit_ll(beta::Vector) instead of 
> clogit_ll(beta::Vector{Float64}), 
> > >>>>> autodiff = true works fine. Thanks for your help! 
> > >>>>> 
> > >>>>> @ Tim: I have set the rather strict default convergence criteria 
> of 
> > >>>>> Optim.optimize to Matlab's default values for fminunc, but the 
> speed 
> > >>>>> difference is still there. 
> > >>>>> 
> > >>>>> @ Miles/John: Getting rid of the global variables through closures 
> and 
> > >>>>> devectorizing made the optimization _slower_ not faster in my 
> case: 
> > >>>>> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was 
> > >>>>> surprised to see this as I expected a speed increase myself. 
> > >>>>> 
> > >>>>> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin: 
> > >>>>>> Just to extend on what John said, also think that if you can 
> > >>>>>> restructure the code to devectorize it and avoid using global 
> > >>>>>> variables, 
> > >>>>>> you'll see *much* better performance. 
> > >>>>>> 
> > >>>>>> The way to avoid globals is by using closures, for example: 
> > >>>>>> function foo(x, data) 
> > >>>>>> 
> > >>>>>>     ... 
> > >>>>>> 
> > >>>>>> end 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> ... 
> > >>>>>> data_raw = readcsv(file) 
> > >>>>>> data = reshape(data_raw, nObs, nChoices*(1+nVar), T) 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> Optim.optimize(x-> foo(x,data), ...) 
> > >>>>>> 
> > >>>>>> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White 
> wrote: 
> > >>>>>>> Glad that you were able to figure out the source of your 
> problems. 
> > >>>>>>> 
> > >>>>>>> It would be good to get a sense of the amount of time spent 
> inside 
> > >>>>>>> your objective function vs. the amount of time spent in the code 
> for 
> > >>>>>>> optimize(). In general, my experience is that >>90% of the 
> compute 
> > >>>>>>> time for 
> > >>>>>>> an optimization problem is spent in the objective function 
> itself. 
> > >>>>>>> If you 
> > >>>>>>> instrument your objective function to produce timing information 
> on 
> > >>>>>>> each 
> > >>>>>>> call, that would help a lot since you could then get a sense of 
> how 
> > >>>>>>> much 
> > >>>>>>> time is being spent in the code for optimize() after accounting 
> for 
> > >>>>>>> your 
> > >>>>>>> function itself. 
> > >>>>>>> 
> > >>>>>>> It’s also worth keeping in mind that your use of implicit finite 
> > >>>>>>> differencing means that your objective function is being called 
> a 
> > >>>>>>> lot more 
> > >>>>>>> times than theoretically necessary, so that any minor 
> performance 
> > >>>>>>> issue in 
> > >>>>>>> it will very substantially slow down the solver. 
> > >>>>>>> 
> > >>>>>>> Regarding you objective function’s code, I suspect that the 
> > >>>>>>> combination of global variables and memory-allocating vectorized 
> > >>>>>>> arithmetic 
> > >>>>>>> means that your objective function might be a good bit slower in 
> > >>>>>>> Julia than 
> > >>>>>>> in Matlab. Matlab seems to be a little better about garbage 
> > >>>>>>> collection for 
> > >>>>>>> vectorized arithmetic and Julia is generally not able to 
> optimize 
> > >>>>>>> code 
> > >>>>>>> involving global variables. 
> > >>>>>>> 
> > >>>>>>> Hope that points you in the right direction. 
> > >>>>>>> 
> > >>>>>>>  — John 
> > >>>>>>> 
> > >>>>>>> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]> 
>
> > >>>>>>> wrote: 
> > >>>>>>> 
> > >>>>>>> Hi Andreas, 
> > >>>>>>> hi John, 
> > >>>>>>> hi Miles (via julia-opt, where I mistakenly also posted my 
> question 
> > >>>>>>> yesterday), 
> > >>>>>>> 
> > >>>>>>> Thanks for your help. Here is the link to the Gist I created: 
> > >>>>>>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962 
> > >>>>>>> 
> > >>>>>>> In the process of producing a minimal (non-)working example, I 
> > >>>>>>> discovered that the unexpected results are due to the truncation 
> of 
> > >>>>>>> the 
> > >>>>>>> logit choice probabilities in the objective function. 
> > >>>>>>> Optim.optimize() is 
> > >>>>>>> sensitive to this when method = :l_bfgs is used. With method = 
> > >>>>>>> 
> > >>>>>>> :nelder_mead, everything works fine. When I comment out the 
> > >>>>>>> :truncation, 
> > >>>>>>> :l_bfgs works as well. However, I need to increase the xtol from 
> its 
> > >>>>>>> 
> > >>>>>>> default of 1e-12 to at least 1e-10, otherwise I get the error 
> that 
> > >>>>>>> the 
> > >>>>>>> linesearch failed to converge. 
> > >>>>>>> 
> > >>>>>>> I guess I should just do without the truncation. The logit 
> > >>>>>>> probabilities are between 0 and 1 by construction anyway. I had 
> just 
> > >>>>>>> copied 
> > >>>>>>> the truncation code from a friend who had told me that 
> probabilities 
> > >>>>>>> that 
> > >>>>>>> are too close to 0 or 1 sometimes cause numerical problems in 
> his 
> > >>>>>>> Matlab 
> > >>>>>>> code of the same function. With Optim.optimize(), it seems to be 
> the 
> > >>>>>>> other 
> > >>>>>>> way around, i.e. moving the probabilities further away from 0 or 
> 1 
> > >>>>>>> (even by 
> > >>>>>>> tiny amounts) means that the stability of the (gradient-based) 
> > >>>>>>> algorithm is 
> > >>>>>>> reduced. 
> > >>>>>>> 
> > >>>>>>> So for me, the problem is solved. The problem was not with 
> Optim.jl, 
> > >>>>>>> but with my own code. 
> > >>>>>>> 
> > >>>>>>> The only other thing that I discovered when trying out Julia and 
> > >>>>>>> Optim.jl is that the optimization is currently considerably 
> slower 
> > >>>>>>> than 
> > >>>>>>> Matlab's fminunc. From the Gist I provided above, are there any 
> > >>>>>>> potential 
> > >>>>>>> performance improvements that I am missing out on? 
> > >>>>>>> 
> > >>>>>>> Best wishes, 
> > >>>>>>> Holger 
> > >>>>>>> 
> > >>>>>>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote: 
> > >>>>>>>> If you can, please do share an example of your code. 
> Logit-style 
> > >>>>>>>> models are in general numerically unstable, so it would be good 
> to 
> > >>>>>>>> see how 
> > >>>>>>>> exactly you’ve coded things up. 
> > >>>>>>>> 
> > >>>>>>>> One thing you may be able to do is use automatic 
> differentiation 
> > >>>>>>>> via the autodiff = true keyword to optimize, but that assumes 
> that 
> > >>>>>>>> your 
> > >>>>>>>> objective function is written in completely pure Julia code 
> (which 
> > >>>>>>>> means, 
> > >>>>>>>> for example, that your code must not call any of functions not 
> > >>>>>>>> written in 
> > >>>>>>>> Julia provided by Distributions.jl). 
> > >>>>>>>> 
> > >>>>>>>>  — John 
> > >>>>>>>> 
> > >>>>>>>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen < 
> > >>>>>>>> [email protected]> wrote: 
> > >>>>>>>> 
> > >>>>>>>> What is the output of versioninfo() and Pkg.installed("Optim")? 
> > >>>>>>>> Also, would it be possible to make a gist with your code? 
> > >>>>>>>> 
> > >>>>>>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>: 
>
> > >>>>>>>>>  Hello, 
> > >>>>>>>>> 
> > >>>>>>>>> I installed Julia a couple of days ago and was impressed how 
> easy 
> > >>>>>>>>> it was to make the switch from Matlab and to parallelize my 
> code 
> > >>>>>>>>> (something I had never done before in any language; I'm an 
> > >>>>>>>>> economist with only limited programming experience, mainly in 
> > >>>>>>>>> Stata and 
> > >>>>>>>>> Matlab). 
> > >>>>>>>>> 
> > >>>>>>>>> However, I ran into a problem when using Optim.jl for Maximum 
> > >>>>>>>>> Likelihood estimation of a conditional logit model. With the 
> > >>>>>>>>> default 
> > >>>>>>>>> Nelder-Mead algorithm, optimize from the Optim.jl package gave 
> me 
> > >>>>>>>>> the same 
> > >>>>>>>>> result that I had obtained in Stata and Matlab. 
> > >>>>>>>>> 
> > >>>>>>>>> With gradient-based methods such as BFGS, however, the 
> algorithm 
> > >>>>>>>>> jumped from the starting values to parameter values that are 
> > >>>>>>>>> completely 
> > >>>>>>>>> different. This happened for all thr starting values I tried, 
> > >>>>>>>>> including the 
> > >>>>>>>>> case in which I took a vector that is closed to the optimum 
> from 
> > >>>>>>>>> the 
> > >>>>>>>>> Nelder-Mead algorithm. 
> > >>>>>>>>> 
> > >>>>>>>>> The problem seems to be that the algorithm tried values so 
> large 
> > >>>>>>>>> (in absolute value) that this caused problems for the 
> objective 
> > >>>>>>>>> function, where I call exponential functions into which these 
> > >>>>>>>>> parameter values enter. As a result, the optimization based on 
> the 
> > >>>>>>>>> BFGS 
> > >>>>>>>>> algorithm did not produce the expected optimum. 
> > >>>>>>>>> 
> > >>>>>>>>> While I could try to provide the analytical gradient in this 
> > >>>>>>>>> simple case, I was planning to use Julia for Maximum 
> Likelihood or 
> > >>>>>>>>> Simulated Maximum Likelihood estimation in cases where the 
> > >>>>>>>>> gradient is more 
> > >>>>>>>>> difficult to derive, so it would be good if I could make the 
> > >>>>>>>>> optimizer run 
> > >>>>>>>>> also with numerical gradients. 
> > >>>>>>>>> 
> > >>>>>>>>> I suspect that my problems with optimize from Optim.jl could 
> have 
> > >>>>>>>>> something to do with the gradient() function. In the example 
> > >>>>>>>>> below, for 
> > >>>>>>>>> instance, I do not understand why the output of the gradient 
> > >>>>>>>>> function 
> > >>>>>>>>> includes values such as 11470.7, given that the function 
> values 
> > >>>>>>>>> differ only 
> > >>>>>>>>> minimally. 
> > >>>>>>>>> 
> > >>>>>>>>> Best wishes, 
> > >>>>>>>>> Holger 
> > >>>>>>>>> 
> > >>>>>>>>> 
> > >>>>>>>>> julia> Optim.gradient(clogit_ll,zeros(4)) 
> > >>>>>>>>> 60554544523933395e-22 
> > >>>>>>>>> 0Op 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.564009972584 
> > >>>>>>>>> -60554544523933395e-22 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.565228435104 
> > >>>>>>>>> 0 
> > >>>>>>>>> 60554544523933395e-22 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.569064311248 
> > >>>>>>>>> 0 
> > >>>>>>>>> -60554544523933395e-22 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.560174904109 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 60554544523933395e-22 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.63413848258 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> -60554544523933395e-22 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.495218282553 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 60554544523933395e-22 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.58699717058 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> 0 
> > >>>>>>>>> -60554544523933395e-22 
> > >>>>>>>>> 
> > >>>>>>>>> 14923.54224130672 
> > >>>>>>>>> 
> > >>>>>>>>> 4-element Array{Float64,1}: 
> > >>>>>>>>>   -100.609 
> > >>>>>>>>>   
> > >>>>>>>>>    734.0 
> > >>>>>>>>>   
> > >>>>>>>>>  11470.7 
> > >>>>>>>>>   
> > >>>>>>>>>   3695.5 
> > >>>>>>>>> 
> > >>>>>>>>> function clogit_ll(beta::Vector) 
> > >>>>>>>>> 
> > >>>>>>>>>     # Print the parameters and the return value to 
> > >>>>>>>>>     # check how gradient() and optimize() work. 
> > >>>>>>>>>     println(beta) 
> > >>>>>>>>>     println(-sum(compute_ll(beta,T,0))) 
> > >>>>>>>>>     
> > >>>>>>>>>     # compute_ll computes the individual likelihood 
> contributions 
> > >>>>>>>>>     # in the sample. T is the number of periods in the panel. 
> The 
> > >>>>>>>>> 
> > >>>>>>>>> 0 
> > >>>>>>>>> 
> > >>>>>>>>>     # is not used in this simple example. In related 
> functions, I 
> > >>>>>>>>>     # pass on different values here to estimate finite 
> mixtures of 
> > >>>>>>>>>     # the conditional logit model. 
> > >>>>>>>>>     return -sum(compute_ll(beta,T,0)) 
> > >>>>>>>>> 
> > >>>>>>>>> end 
> > >>>>>>>> 
> > >>>>>>>> Andreas Noack Jensen 
>
>

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to