Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Holger Stichnoth Tue, 27 May 2014 06:04:01 -0700

Hi John, hi Miles,

Thanks to both of you. I did not have time to look into this over the 
weekend; I will do so in the next couple of days. I have already uploaded 
the Matlab files for comparison: 
https://gist.github.com/stichnoth/7f251ded83dcaa384273


Holger


On Thursday, 22 May 2014 23:03:58 UTC+1, John Myles White wrote:
>
> Yeah, this case is tricky enough that we really need to get down to the 
> lowest details:
>
> (1) Do Julia and Matlab perform similar numbers of function evaluations?
>
> (2) If they don't perform similar numbers of function evaluations, is one 
> of them producing a better solution? Is the one that's producing a better 
> solution doing more function evaluations?
>
> (3) If they're doing different numbers of function evaluations and the one 
> that does _fewer_ evaluations also produces a better solution, what's the 
> reason? For example, is our line search default less effective for this 
> problem than the Matlab line search? If you try other line search 
> algorithms, do the results stay the same?
>
> Unfortunately, answering all of these reliably make take us all pretty far 
> down the rabbit hole. But they're worth pushing on systematically.
>
>  -- John
>
> On May 22, 2014, at 2:59 PM, Miles Lubin <[email protected] <javascript:>> 
> wrote:
>
> I can get another 50% speedup by:
>
> - Running the optimization twice and timing the second run only, this is 
> the more appropriate way to benchmark julia because it excludes the 
> function compilation time
> - Setting autodiff=true
> - Breaking up the long chains of sums, apparently these seem to be slow
>
> At this point one really needs to compare the number of function 
> evaluations in each method, as John suggested.
>
> On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote:
>>
>> Thanks, it's faster now (by roughly a factor of 3 on my computer), but 
>> still considerably slower than fminunc:
>>
>> Averages over 20 runs:
>> Julia/Optim.optimize: 10.5s
>> Matlab/fminunc: 2.6s
>>
>> Here are my Matlab settings:
>> options = optimset('Display', 'iter', ...
>>      'MaxIter', 2500, 'MaxFunEvals', 500000, ...
>>      'TolFun', 1e-6, 'TolX', 1e-6, ...
>>      'GradObj', 'off', 'DerivativeCheck', 'off');
>>
>> startb      = ones(1,nVar)';
>> [estim_clo, ll_clo]= ...
>>      fminunc(@(param)clogit_ll(param,data), ...
>>      startb,options);
>>
>> Could the speed issue be related to the following messages that I get 
>> when I run the Julia code?
>> C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
>> Warning: could not import Base.foldl into NumericExtensions
>> Warning: could not import Base.foldr into NumericExtensions
>> Warning: could not import Base.sum! into NumericExtensions
>> Warning: could not import Base.maximum! into NumericExtensions
>> Warning: could not import Base.minimum! into NumericExtensions
>>
>>
>>
>>
>> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin:
>>>
>>> I was able to get a nearly 5x speedup by avoiding the matrix allocation 
>>> and making the accumulators type stable: 
>>> https://gist.github.com/mlubin/055690ddf2466e98bba6
>>>
>>> How does this compare with Matlab now?
>>>
>>> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote:
>>>>
>>>> @ John: You are right, when I specify the function as 
>>>> clogit_ll(beta::Vector) instead of clogit_ll(beta::Vector{Float64}), 
>>>> autodiff = true works fine. Thanks for your help!
>>>>
>>>> @ Tim: I have set the rather strict default convergence criteria of 
>>>> Optim.optimize to Matlab's default values for fminunc, but the speed 
>>>> difference is still there.
>>>>
>>>> @ Miles/John: Getting rid of the global variables through closures and 
>>>> devectorizing made the optimization _slower_ not faster in my case: 
>>>> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was 
>>>> surprised to see this as I expected a speed increase myself. 
>>>>  
>>>>
>>>> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin:
>>>>>
>>>>> Just to extend on what John said, also think that if you can 
>>>>> restructure the code to devectorize it and avoid using global variables, 
>>>>> you'll see *much* better performance. 
>>>>>
>>>>> The way to avoid globals is by using closures, for example:
>>>>> function foo(x, data)
>>>>>     ...
>>>>> end
>>>>>
>>>>>
>>>>> ...
>>>>> data_raw = readcsv(file)
>>>>> data = reshape(data_raw, nObs, nChoices*(1+nVar), T)
>>>>>
>>>>>
>>>>>
>>>>> Optim.optimize(x-> foo(x,data), ...)
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White wrote:
>>>>>>
>>>>>> Glad that you were able to figure out the source of your problems.
>>>>>>
>>>>>> It would be good to get a sense of the amount of time spent inside 
>>>>>> your objective function vs. the amount of time spent in the code for 
>>>>>> optimize(). In general, my experience is that >>90% of the compute time 
>>>>>> for 
>>>>>> an optimization problem is spent in the objective function itself. If 
>>>>>> you 
>>>>>> instrument your objective function to produce timing information on each 
>>>>>> call, that would help a lot since you could then get a sense of how much 
>>>>>> time is being spent in the code for optimize() after accounting for your 
>>>>>> function itself.
>>>>>>
>>>>>> It’s also worth keeping in mind that your use of implicit finite 
>>>>>> differencing means that your objective function is being called a lot 
>>>>>> more 
>>>>>> times than theoretically necessary, so that any minor performance issue 
>>>>>> in 
>>>>>> it will very substantially slow down the solver.
>>>>>>
>>>>>> Regarding you objective function’s code, I suspect that the 
>>>>>> combination of global variables and memory-allocating vectorized 
>>>>>> arithmetic 
>>>>>> means that your objective function might be a good bit slower in Julia 
>>>>>> than 
>>>>>> in Matlab. Matlab seems to be a little better about garbage collection 
>>>>>> for 
>>>>>> vectorized arithmetic and Julia is generally not able to optimize code 
>>>>>> involving global variables.
>>>>>>
>>>>>> Hope that points you in the right direction.
>>>>>>
>>>>>>  — John
>>>>>>
>>>>>> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>> Hi Andreas,
>>>>>> hi John,
>>>>>> hi Miles (via julia-opt, where I mistakenly also posted my question 
>>>>>> yesterday),
>>>>>>
>>>>>> Thanks for your help. Here is the link to the Gist I created: 
>>>>>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962
>>>>>>
>>>>>> In the process of producing a minimal (non-)working example, I 
>>>>>> discovered that the unexpected results are due to the truncation of the 
>>>>>> logit choice probabilities in the objective function. Optim.optimize() 
>>>>>> is 
>>>>>> sensitive to this when method = :l_bfgs is used. With method = 
>>>>>> :nelder_mead, everything works fine. When I comment out the truncation, 
>>>>>> :l_bfgs works as well. However, I need to increase the xtol from its 
>>>>>> default of 1e-12 to at least 1e-10, otherwise I get the error that the 
>>>>>> linesearch failed to converge.
>>>>>>
>>>>>> I guess I should just do without the truncation. The logit 
>>>>>> probabilities are between 0 and 1 by construction anyway. I had just 
>>>>>> copied 
>>>>>> the truncation code from a friend who had told me that probabilities 
>>>>>> that 
>>>>>> are too close to 0 or 1 sometimes cause numerical problems in his Matlab 
>>>>>> code of the same function. With Optim.optimize(), it seems to be the 
>>>>>> other 
>>>>>> way around, i.e. moving the probabilities further away from 0 or 1 (even 
>>>>>> by 
>>>>>> tiny amounts) means that the stability of the (gradient-based) algorithm 
>>>>>> is 
>>>>>> reduced.
>>>>>>
>>>>>> So for me, the problem is solved. The problem was not with Optim.jl, 
>>>>>> but with my own code.
>>>>>>
>>>>>> The only other thing that I discovered when trying out Julia and 
>>>>>> Optim.jl is that the optimization is currently considerably slower than 
>>>>>> Matlab's fminunc. From the Gist I provided above, are there any 
>>>>>> potential 
>>>>>> performance improvements that I am missing out on?
>>>>>>
>>>>>> Best wishes,
>>>>>> Holger
>>>>>>
>>>>>>
>>>>>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:
>>>>>>>
>>>>>>> If you can, please do share an example of your code. Logit-style 
>>>>>>> models are in general numerically unstable, so it would be good to see 
>>>>>>> how 
>>>>>>> exactly you’ve coded things up.
>>>>>>>
>>>>>>> One thing you may be able to do is use automatic differentiation via 
>>>>>>> the autodiff = true keyword to optimize, but that assumes that your 
>>>>>>> objective function is written in completely pure Julia code (which 
>>>>>>> means, 
>>>>>>> for example, that your code must not call any of functions not written 
>>>>>>> in 
>>>>>>> Julia provided by Distributions.jl).
>>>>>>>
>>>>>>>  — John
>>>>>>>
>>>>>>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> What is the output of versioninfo() and Pkg.installed("Optim")? 
>>>>>>> Also, would it be possible to make a gist with your code?
>>>>>>>
>>>>>>>
>>>>>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>:
>>>>>>>
>>>>>>>>  Hello,
>>>>>>>>
>>>>>>>> I installed Julia a couple of days ago and was impressed how easy 
>>>>>>>> it was to make the switch from Matlab and to parallelize my code
>>>>>>>> (something I had never done before in any language; I'm an 
>>>>>>>> economist with only limited programming experience, mainly in Stata 
>>>>>>>> and 
>>>>>>>> Matlab).
>>>>>>>>
>>>>>>>> However, I ran into a problem when using Optim.jl for Maximum 
>>>>>>>> Likelihood estimation of a conditional logit model. With the default 
>>>>>>>> Nelder-Mead algorithm, optimize from the Optim.jl package gave me the 
>>>>>>>> same 
>>>>>>>> result that I had obtained in Stata and Matlab.
>>>>>>>>
>>>>>>>> With gradient-based methods such as BFGS, however, the algorithm 
>>>>>>>> jumped from the starting values to parameter values that are 
>>>>>>>> completely 
>>>>>>>> different. This happened for all thr starting values I tried, 
>>>>>>>> including the 
>>>>>>>> case in which I took a vector that is closed to the optimum from the 
>>>>>>>> Nelder-Mead algorithm.  
>>>>>>>>
>>>>>>>> The problem seems to be that the algorithm tried values so large 
>>>>>>>> (in absolute value) that this caused problems for the objective
>>>>>>>> function, where I call exponential functions into which these 
>>>>>>>> parameter values enter. As a result, the optimization based on the 
>>>>>>>> BFGS 
>>>>>>>> algorithm did not produce the expected optimum.
>>>>>>>>
>>>>>>>> While I could try to provide the analytical gradient in this simple 
>>>>>>>> case, I was planning to use Julia for Maximum Likelihood or Simulated 
>>>>>>>> Maximum Likelihood estimation in cases where the gradient is more 
>>>>>>>> difficult 
>>>>>>>> to derive, so it would be good if I could make the optimizer run also 
>>>>>>>> with 
>>>>>>>> numerical gradients.
>>>>>>>>
>>>>>>>> I suspect that my problems with optimize from Optim.jl could have 
>>>>>>>> something to do with the gradient() function. In the example below, 
>>>>>>>> for 
>>>>>>>> instance, I do not understand why the output of the gradient function 
>>>>>>>> includes values such as 11470.7, given that the function values differ 
>>>>>>>> only 
>>>>>>>> minimally.
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>> Holger
>>>>>>>>
>>>>>>>>
>>>>>>>> julia> Optim.gradient(clogit_ll,zeros(4))
>>>>>>>> 60554544523933395e-22
>>>>>>>> 0Op
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.564009972584
>>>>>>>> -60554544523933395e-22
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.565228435104
>>>>>>>> 0
>>>>>>>> 60554544523933395e-22
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.569064311248
>>>>>>>> 0
>>>>>>>> -60554544523933395e-22
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.560174904109
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> 60554544523933395e-22
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.63413848258
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> -60554544523933395e-22
>>>>>>>> 0
>>>>>>>>
>>>>>>>> 14923.495218282553
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> 60554544523933395e-22
>>>>>>>>
>>>>>>>> 14923.58699717058
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> 0
>>>>>>>> -60554544523933395e-22
>>>>>>>>
>>>>>>>> 14923.54224130672
>>>>>>>> 4-element Array{Float64,1}:
>>>>>>>>   -100.609
>>>>>>>>    734.0
>>>>>>>>  11470.7
>>>>>>>>   3695.5
>>>>>>>>
>>>>>>>> function clogit_ll(beta::Vector)
>>>>>>>>
>>>>>>>>     # Print the parameters and the return value to
>>>>>>>>     # check how gradient() and optimize() work.
>>>>>>>>     println(beta) 
>>>>>>>>     println(-sum(compute_ll(beta,T,0)))
>>>>>>>>
>>>>>>>>     # compute_ll computes the individual likelihood contributions
>>>>>>>>     # in the sample. T is the number of periods in the panel. The 0
>>>>>>>>     # is not used in this simple example. In related functions, I
>>>>>>>>     # pass on different values here to estimate finite mixtures of
>>>>>>>>     # the conditional logit model.
>>>>>>>>     return -sum(compute_ll(beta,T,0))
>>>>>>>> end
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Med venlig hilsen
>>>>>>>
>>>>>>> Andreas Noack Jensen
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>
>

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to