Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Miles Lubin Thu, 22 May 2014 14:59:59 -0700

I can get another 50% speedup by:

- Running the optimization twice and timing the second run only, this is 
the more appropriate way to benchmark julia because it excludes the 
function compilation time
- Setting autodiff=true
- Breaking up the long chains of sums, apparently these seem to be slow


At this point one really needs to compare the number of function 
evaluations in each method, as John suggested.

On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote:
>
> Thanks, it's faster now (by roughly a factor of 3 on my computer), but 
> still considerably slower than fminunc:
>
> Averages over 20 runs:
> Julia/Optim.optimize: 10.5s
> Matlab/fminunc: 2.6s
>
> Here are my Matlab settings:
> options = optimset('Display', 'iter', ...
>      'MaxIter', 2500, 'MaxFunEvals', 500000, ...
>      'TolFun', 1e-6, 'TolX', 1e-6, ...
>      'GradObj', 'off', 'DerivativeCheck', 'off');
>
> startb      = ones(1,nVar)';
> [estim_clo, ll_clo]= ...
>      fminunc(@(param)clogit_ll(param,data), ...
>      startb,options);
>
> Could the speed issue be related to the following messages that I get when 
> I run the Julia code?
> C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
> Warning: could not import Base.foldl into NumericExtensions
> Warning: could not import Base.foldr into NumericExtensions
> Warning: could not import Base.sum! into NumericExtensions
> Warning: could not import Base.maximum! into NumericExtensions
> Warning: could not import Base.minimum! into NumericExtensions
>
>
>
>
> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin:
>>
>> I was able to get a nearly 5x speedup by avoiding the matrix allocation 
>> and making the accumulators type stable: 
>> https://gist.github.com/mlubin/055690ddf2466e98bba6
>>
>> How does this compare with Matlab now?
>>
>> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote:
>>>
>>> @ John: You are right, when I specify the function as 
>>> clogit_ll(beta::Vector) instead of clogit_ll(beta::Vector{Float64}), 
>>> autodiff = true works fine. Thanks for your help!
>>>
>>> @ Tim: I have set the rather strict default convergence criteria of 
>>> Optim.optimize to Matlab's default values for fminunc, but the speed 
>>> difference is still there.
>>>
>>> @ Miles/John: Getting rid of the global variables through closures and 
>>> devectorizing made the optimization _slower_ not faster in my case: 
>>> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was surprised 
>>> to see this as I expected a speed increase myself. 
>>>  
>>>
>>> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin:
>>>>
>>>> Just to extend on what John said, also think that if you can 
>>>> restructure the code to devectorize it and avoid using global variables, 
>>>> you'll see *much* better performance. 
>>>>
>>>> The way to avoid globals is by using closures, for example:
>>>> function foo(x, data)
>>>>     ...
>>>> end
>>>>
>>>>
>>>> ...
>>>> data_raw = readcsv(file)
>>>> data = reshape(data_raw, nObs, nChoices*(1+nVar), T)
>>>>
>>>>
>>>>
>>>> Optim.optimize(x-> foo(x,data), ...)
>>>>
>>>>
>>>>
>>>> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White wrote:
>>>>>
>>>>> Glad that you were able to figure out the source of your problems.
>>>>>
>>>>> It would be good to get a sense of the amount of time spent inside 
>>>>> your objective function vs. the amount of time spent in the code for 
>>>>> optimize(). In general, my experience is that >>90% of the compute time 
>>>>> for 
>>>>> an optimization problem is spent in the objective function itself. If you 
>>>>> instrument your objective function to produce timing information on each 
>>>>> call, that would help a lot since you could then get a sense of how much 
>>>>> time is being spent in the code for optimize() after accounting for your 
>>>>> function itself.
>>>>>
>>>>> It’s also worth keeping in mind that your use of implicit finite 
>>>>> differencing means that your objective function is being called a lot 
>>>>> more 
>>>>> times than theoretically necessary, so that any minor performance issue 
>>>>> in 
>>>>> it will very substantially slow down the solver.
>>>>>
>>>>> Regarding you objective function’s code, I suspect that the 
>>>>> combination of global variables and memory-allocating vectorized 
>>>>> arithmetic 
>>>>> means that your objective function might be a good bit slower in Julia 
>>>>> than 
>>>>> in Matlab. Matlab seems to be a little better about garbage collection 
>>>>> for 
>>>>> vectorized arithmetic and Julia is generally not able to optimize code 
>>>>> involving global variables.
>>>>>
>>>>> Hope that points you in the right direction.
>>>>>
>>>>>  — John
>>>>>
>>>>> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]> 
>>>>> wrote:
>>>>>
>>>>> Hi Andreas,
>>>>> hi John,
>>>>> hi Miles (via julia-opt, where I mistakenly also posted my question 
>>>>> yesterday),
>>>>>
>>>>> Thanks for your help. Here is the link to the Gist I created: 
>>>>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962
>>>>>
>>>>> In the process of producing a minimal (non-)working example, I 
>>>>> discovered that the unexpected results are due to the truncation of the 
>>>>> logit choice probabilities in the objective function. Optim.optimize() is 
>>>>> sensitive to this when method = :l_bfgs is used. With method = 
>>>>> :nelder_mead, everything works fine. When I comment out the truncation, 
>>>>> :l_bfgs works as well. However, I need to increase the xtol from its 
>>>>> default of 1e-12 to at least 1e-10, otherwise I get the error that the 
>>>>> linesearch failed to converge.
>>>>>
>>>>> I guess I should just do without the truncation. The logit 
>>>>> probabilities are between 0 and 1 by construction anyway. I had just 
>>>>> copied 
>>>>> the truncation code from a friend who had told me that probabilities that 
>>>>> are too close to 0 or 1 sometimes cause numerical problems in his Matlab 
>>>>> code of the same function. With Optim.optimize(), it seems to be the 
>>>>> other 
>>>>> way around, i.e. moving the probabilities further away from 0 or 1 (even 
>>>>> by 
>>>>> tiny amounts) means that the stability of the (gradient-based) algorithm 
>>>>> is 
>>>>> reduced.
>>>>>
>>>>> So for me, the problem is solved. The problem was not with Optim.jl, 
>>>>> but with my own code.
>>>>>
>>>>> The only other thing that I discovered when trying out Julia and 
>>>>> Optim.jl is that the optimization is currently considerably slower than 
>>>>> Matlab's fminunc. From the Gist I provided above, are there any potential 
>>>>> performance improvements that I am missing out on?
>>>>>
>>>>> Best wishes,
>>>>> Holger
>>>>>
>>>>>
>>>>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:
>>>>>>
>>>>>> If you can, please do share an example of your code. Logit-style 
>>>>>> models are in general numerically unstable, so it would be good to see 
>>>>>> how 
>>>>>> exactly you’ve coded things up.
>>>>>>
>>>>>> One thing you may be able to do is use automatic differentiation via 
>>>>>> the autodiff = true keyword to optimize, but that assumes that your 
>>>>>> objective function is written in completely pure Julia code (which 
>>>>>> means, 
>>>>>> for example, that your code must not call any of functions not written 
>>>>>> in 
>>>>>> Julia provided by Distributions.jl).
>>>>>>
>>>>>>  — John
>>>>>>
>>>>>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> What is the output of versioninfo() and Pkg.installed("Optim")? Also, 
>>>>>> would it be possible to make a gist with your code?
>>>>>>
>>>>>>
>>>>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>:
>>>>>>
>>>>>>>  Hello,
>>>>>>>
>>>>>>> I installed Julia a couple of days ago and was impressed how easy it 
>>>>>>> was to make the switch from Matlab and to parallelize my code
>>>>>>> (something I had never done before in any language; I'm an economist 
>>>>>>> with only limited programming experience, mainly in Stata and Matlab).
>>>>>>>
>>>>>>> However, I ran into a problem when using Optim.jl for Maximum 
>>>>>>> Likelihood estimation of a conditional logit model. With the default 
>>>>>>> Nelder-Mead algorithm, optimize from the Optim.jl package gave me the 
>>>>>>> same 
>>>>>>> result that I had obtained in Stata and Matlab.
>>>>>>>
>>>>>>> With gradient-based methods such as BFGS, however, the algorithm 
>>>>>>> jumped from the starting values to parameter values that are completely 
>>>>>>> different. This happened for all thr starting values I tried, including 
>>>>>>> the 
>>>>>>> case in which I took a vector that is closed to the optimum from the 
>>>>>>> Nelder-Mead algorithm.  
>>>>>>>
>>>>>>> The problem seems to be that the algorithm tried values so large (in 
>>>>>>> absolute value) that this caused problems for the objective
>>>>>>> function, where I call exponential functions into which these 
>>>>>>> parameter values enter. As a result, the optimization based on the BFGS 
>>>>>>> algorithm did not produce the expected optimum.
>>>>>>>
>>>>>>> While I could try to provide the analytical gradient in this simple 
>>>>>>> case, I was planning to use Julia for Maximum Likelihood or Simulated 
>>>>>>> Maximum Likelihood estimation in cases where the gradient is more 
>>>>>>> difficult 
>>>>>>> to derive, so it would be good if I could make the optimizer run also 
>>>>>>> with 
>>>>>>> numerical gradients.
>>>>>>>
>>>>>>> I suspect that my problems with optimize from Optim.jl could have 
>>>>>>> something to do with the gradient() function. In the example below, for 
>>>>>>> instance, I do not understand why the output of the gradient function 
>>>>>>> includes values such as 11470.7, given that the function values differ 
>>>>>>> only 
>>>>>>> minimally.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>> Holger
>>>>>>>
>>>>>>>
>>>>>>> julia> Optim.gradient(clogit_ll,zeros(4))
>>>>>>> 60554544523933395e-22
>>>>>>> 0Op
>>>>>>> 0
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.564009972584
>>>>>>> -60554544523933395e-22
>>>>>>> 0
>>>>>>> 0
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.565228435104
>>>>>>> 0
>>>>>>> 60554544523933395e-22
>>>>>>> 0
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.569064311248
>>>>>>> 0
>>>>>>> -60554544523933395e-22
>>>>>>> 0
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.560174904109
>>>>>>> 0
>>>>>>> 0
>>>>>>> 60554544523933395e-22
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.63413848258
>>>>>>> 0
>>>>>>> 0
>>>>>>> -60554544523933395e-22
>>>>>>> 0
>>>>>>>
>>>>>>> 14923.495218282553
>>>>>>> 0
>>>>>>> 0
>>>>>>> 0
>>>>>>> 60554544523933395e-22
>>>>>>>
>>>>>>> 14923.58699717058
>>>>>>> 0
>>>>>>> 0
>>>>>>> 0
>>>>>>> -60554544523933395e-22
>>>>>>>
>>>>>>> 14923.54224130672
>>>>>>> 4-element Array{Float64,1}:
>>>>>>>   -100.609
>>>>>>>    734.0
>>>>>>>  11470.7
>>>>>>>   3695.5
>>>>>>>
>>>>>>> function clogit_ll(beta::Vector)
>>>>>>>
>>>>>>>     # Print the parameters and the return value to
>>>>>>>     # check how gradient() and optimize() work.
>>>>>>>     println(beta) 
>>>>>>>     println(-sum(compute_ll(beta,T,0)))
>>>>>>>
>>>>>>>     # compute_ll computes the individual likelihood contributions
>>>>>>>     # in the sample. T is the number of periods in the panel. The 0
>>>>>>>     # is not used in this simple example. In related functions, I
>>>>>>>     # pass on different values here to estimate finite mixtures of
>>>>>>>     # the conditional logit model.
>>>>>>>     return -sum(compute_ll(beta,T,0))
>>>>>>> end
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Med venlig hilsen
>>>>>>
>>>>>> Andreas Noack Jensen
>>>>>>  
>>>>>>
>>>>>>
>>>>>

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to