Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Thibaut Lamadon Sun, 22 Jun 2014 04:51:13 -0700

Hi guys, 

I wanted to look into this as well. The main issue I think is in the speed 
of the objective function. Running @time on the objective function 
suggested a large amount of byte allocation. Checking the type revealed 
that getting x and y from data would set their types to Any.


So I convert the data to {Float64,3}, and then I changed to only store 
cumulative sum, not the vector of likelihood. I run the objective function 
50 times.

without the convert I get a total time of 9.18s 
with the convert and original function I get 4.15s
with the convert and the new function I get 1.49s
with matlab, I get 0.64s 

so matlab still appears to be 2.5 times faster. But I am guessing matlab is 
using SIMD instructions when computing matrix multiplications. So we would 
need to try to use BLAS in julia with matrix multiplication to get a very 
good comparison.

Anyway, fixing the type of the input, and just summing inside the loop 
gives a 6x speed up. 

PS: Running the full optimization 
with the convert and the new function I get 4.8s 
with my matlab I get 4s 

I could not commit to Holger gist, so I forked 
it: https://gist.github.com/tlamadon/58c47c115f8cf2388e89
please check that I have not done anything stupid, but output seemed 
similar.

Holger, I hope you are having a good time at home (or in Paris?),  And a 
world cup note: Allez les bleus!

very best,

t.


On Tuesday, 27 May 2014 14:03:30 UTC+1, Holger Stichnoth wrote:
>
> Hi John, hi Miles,
>
> Thanks to both of you. I did not have time to look into this over the 
> weekend; I will do so in the next couple of days. I have already uploaded 
> the Matlab files for comparison: 
> https://gist.github.com/stichnoth/7f251ded83dcaa384273
>
> Holger
>
>
> On Thursday, 22 May 2014 23:03:58 UTC+1, John Myles White wrote:
>>
>> Yeah, this case is tricky enough that we really need to get down to the 
>> lowest details:
>>
>> (1) Do Julia and Matlab perform similar numbers of function evaluations?
>>
>> (2) If they don't perform similar numbers of function evaluations, is one 
>> of them producing a better solution? Is the one that's producing a better 
>> solution doing more function evaluations?
>>
>> (3) If they're doing different numbers of function evaluations and the 
>> one that does _fewer_ evaluations also produces a better solution, what's 
>> the reason? For example, is our line search default less effective for this 
>> problem than the Matlab line search? If you try other line search 
>> algorithms, do the results stay the same?
>>
>> Unfortunately, answering all of these reliably make take us all pretty 
>> far down the rabbit hole. But they're worth pushing on systematically.
>>
>>  -- John
>>
>> On May 22, 2014, at 2:59 PM, Miles Lubin <[email protected]> wrote:
>>
>> I can get another 50% speedup by:
>>
>> - Running the optimization twice and timing the second run only, this is 
>> the more appropriate way to benchmark julia because it excludes the 
>> function compilation time
>> - Setting autodiff=true
>> - Breaking up the long chains of sums, apparently these seem to be slow
>>
>> At this point one really needs to compare the number of function 
>> evaluations in each method, as John suggested.
>>
>> On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote:
>>>
>>> Thanks, it's faster now (by roughly a factor of 3 on my computer), but 
>>> still considerably slower than fminunc:
>>>
>>> Averages over 20 runs:
>>> Julia/Optim.optimize: 10.5s
>>> Matlab/fminunc: 2.6s
>>>
>>> Here are my Matlab settings:
>>> options = optimset('Display', 'iter', ...
>>>      'MaxIter', 2500, 'MaxFunEvals', 500000, ...
>>>      'TolFun', 1e-6, 'TolX', 1e-6, ...
>>>      'GradObj', 'off', 'DerivativeCheck', 'off');
>>>
>>> startb      = ones(1,nVar)';
>>> [estim_clo, ll_clo]= ...
>>>      fminunc(@(param)clogit_ll(param,data), ...
>>>      startb,options);
>>>
>>> Could the speed issue be related to the following messages that I get 
>>> when I run the Julia code?
>>> C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
>>> Warning: could not import Base.foldl into NumericExtensions
>>> Warning: could not import Base.foldr into NumericExtensions
>>> Warning: could not import Base.sum! into NumericExtensions
>>> Warning: could not import Base.maximum! into NumericExtensions
>>> Warning: could not import Base.minimum! into NumericExtensions
>>>
>>>
>>>
>>>
>>> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin:
>>>>
>>>> I was able to get a nearly 5x speedup by avoiding the matrix allocation 
>>>> and making the accumulators type stable: 
>>>> https://gist.github.com/mlubin/055690ddf2466e98bba6
>>>>
>>>> How does this compare with Matlab now?
>>>>
>>>> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote:
>>>>>
>>>>> @ John: You are right, when I specify the function as 
>>>>> clogit_ll(beta::Vector) instead of clogit_ll(beta::Vector{Float64}), 
>>>>> autodiff = true works fine. Thanks for your help!
>>>>>
>>>>> @ Tim: I have set the rather strict default convergence criteria of 
>>>>> Optim.optimize to Matlab's default values for fminunc, but the speed 
>>>>> difference is still there.
>>>>>
>>>>> @ Miles/John: Getting rid of the global variables through closures and 
>>>>> devectorizing made the optimization _slower_ not faster in my case: 
>>>>> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was 
>>>>> surprised to see this as I expected a speed increase myself. 
>>>>>  
>>>>>
>>>>> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin:
>>>>>>
>>>>>> Just to extend on what John said, also think that if you can 
>>>>>> restructure the code to devectorize it and avoid using global variables, 
>>>>>> you'll see *much* better performance. 
>>>>>>
>>>>>> The way to avoid globals is by using closures, for example:
>>>>>> function foo(x, data)
>>>>>>     ...
>>>>>> end
>>>>>>
>>>>>>
>>>>>> ...
>>>>>> data_raw = readcsv(file)
>>>>>> data = reshape(data_raw, nObs, nChoices*(1+nVar), T)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Optim.optimize(x-> foo(x,data), ...)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White wrote:
>>>>>>>
>>>>>>> Glad that you were able to figure out the source of your problems.
>>>>>>>
>>>>>>> It would be good to get a sense of the amount of time spent inside 
>>>>>>> your objective function vs. the amount of time spent in the code for 
>>>>>>> optimize(). In general, my experience is that >>90% of the compute time 
>>>>>>> for 
>>>>>>> an optimization problem is spent in the objective function itself. If 
>>>>>>> you 
>>>>>>> instrument your objective function to produce timing information on 
>>>>>>> each 
>>>>>>> call, that would help a lot since you could then get a sense of how 
>>>>>>> much 
>>>>>>> time is being spent in the code for optimize() after accounting for 
>>>>>>> your 
>>>>>>> function itself.
>>>>>>>
>>>>>>> It’s also worth keeping in mind that your use of implicit finite 
>>>>>>> differencing means that your objective function is being called a lot 
>>>>>>> more 
>>>>>>> times than theoretically necessary, so that any minor performance issue 
>>>>>>> in 
>>>>>>> it will very substantially slow down the solver.
>>>>>>>
>>>>>>> Regarding you objective function’s code, I suspect that the 
>>>>>>> combination of global variables and memory-allocating vectorized 
>>>>>>> arithmetic 
>>>>>>> means that your objective function might be a good bit slower in Julia 
>>>>>>> than 
>>>>>>> in Matlab. Matlab seems to be a little better about garbage collection 
>>>>>>> for 
>>>>>>> vectorized arithmetic and Julia is generally not able to optimize code 
>>>>>>> involving global variables.
>>>>>>>
>>>>>>> Hope that points you in the right direction.
>>>>>>>
>>>>>>>  — John
>>>>>>>
>>>>>>> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Andreas,
>>>>>>> hi John,
>>>>>>> hi Miles (via julia-opt, where I mistakenly also posted my question 
>>>>>>> yesterday),
>>>>>>>
>>>>>>> Thanks for your help. Here is the link to the Gist I created: 
>>>>>>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962
>>>>>>>
>>>>>>> In the process of producing a minimal (non-)working example, I 
>>>>>>> discovered that the unexpected results are due to the truncation of the 
>>>>>>> logit choice probabilities in the objective function. Optim.optimize() 
>>>>>>> is 
>>>>>>> sensitive to this when method = :l_bfgs is used. With method = 
>>>>>>> :nelder_mead, everything works fine. When I comment out the truncation, 
>>>>>>> :l_bfgs works as well. However, I need to increase the xtol from its 
>>>>>>> default of 1e-12 to at least 1e-10, otherwise I get the error that the 
>>>>>>> linesearch failed to converge.
>>>>>>>
>>>>>>> I guess I should just do without the truncation. The logit 
>>>>>>> probabilities are between 0 and 1 by construction anyway. I had just 
>>>>>>> copied 
>>>>>>> the truncation code from a friend who had told me that probabilities 
>>>>>>> that 
>>>>>>> are too close to 0 or 1 sometimes cause numerical problems in his 
>>>>>>> Matlab 
>>>>>>> code of the same function. With Optim.optimize(), it seems to be the 
>>>>>>> other 
>>>>>>> way around, i.e. moving the probabilities further away from 0 or 1 
>>>>>>> (even by 
>>>>>>> tiny amounts) means that the stability of the (gradient-based) 
>>>>>>> algorithm is 
>>>>>>> reduced.
>>>>>>>
>>>>>>> So for me, the problem is solved. The problem was not with Optim.jl, 
>>>>>>> but with my own code.
>>>>>>>
>>>>>>> The only other thing that I discovered when trying out Julia and 
>>>>>>> Optim.jl is that the optimization is currently considerably slower than 
>>>>>>> Matlab's fminunc. From the Gist I provided above, are there any 
>>>>>>> potential 
>>>>>>> performance improvements that I am missing out on?
>>>>>>>
>>>>>>> Best wishes,
>>>>>>> Holger
>>>>>>>
>>>>>>>
>>>>>>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:
>>>>>>>>
>>>>>>>> If you can, please do share an example of your code. Logit-style 
>>>>>>>> models are in general numerically unstable, so it would be good to see 
>>>>>>>> how 
>>>>>>>> exactly you’ve coded things up.
>>>>>>>>
>>>>>>>> One thing you may be able to do is use automatic differentiation 
>>>>>>>> via the autodiff = true keyword to optimize, but that assumes that 
>>>>>>>> your 
>>>>>>>> objective function is written in completely pure Julia code (which 
>>>>>>>> means, 
>>>>>>>> for example, that your code must not call any of functions not written 
>>>>>>>> in 
>>>>>>>> Julia provided by Distributions.jl).
>>>>>>>>
>>>>>>>>  — John
>>>>>>>>
>>>>>>>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>> What is the output of versioninfo() and Pkg.installed("Optim")? 
>>>>>>>> Also, would it be possible to make a gist with your code?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>:
>>>>>>>>
>>>>>>>>>  Hello,
>>>>>>>>>
>>>>>>>>> I installed Julia a couple of days ago and was impressed how easy 
>>>>>>>>> it was to make the switch from Matlab and to parallelize my code
>>>>>>>>> (something I had never done before in any language; I'm an 
>>>>>>>>> economist with only limited programming experience, mainly in Stata 
>>>>>>>>> and 
>>>>>>>>> Matlab).
>>>>>>>>>
>>>>>>>>> However, I ran into a problem when using Optim.jl for Maximum 
>>>>>>>>> Likelihood estimation of a conditional logit model. With the default 
>>>>>>>>> Nelder-Mead algorithm, optimize from the Optim.jl package gave me the 
>>>>>>>>> same 
>>>>>>>>> result that I had obtained in Stata and Matlab.
>>>>>>>>>
>>>>>>>>> With gradient-based methods such as BFGS, however, the algorithm 
>>>>>>>>> jumped from the starting values to parameter values that are 
>>>>>>>>> completely 
>>>>>>>>> different. This happened for all thr starting values I tried, 
>>>>>>>>> including the 
>>>>>>>>> case in which I took a vector that is closed to the optimum from the 
>>>>>>>>> Nelder-Mead algorithm.  
>>>>>>>>>
>>>>>>>>> The problem seems to be that the algorithm tried values so large 
>>>>>>>>> (in absolute value) that this caused problems for the objective
>>>>>>>>> function, where I call exponential functions into which these 
>>>>>>>>> parameter values enter. As a result, the optimization based on the 
>>>>>>>>> BFGS 
>>>>>>>>> algorithm did not produce the expected optimum.
>>>>>>>>>
>>>>>>>>> While I could try to provide the analytical gradient in this 
>>>>>>>>> simple case, I was planning to use Julia for Maximum Likelihood or 
>>>>>>>>> Simulated Maximum Likelihood estimation in cases where the gradient 
>>>>>>>>> is more 
>>>>>>>>> difficult to derive, so it would be good if I could make the 
>>>>>>>>> optimizer run 
>>>>>>>>> also with numerical gradients.
>>>>>>>>>
>>>>>>>>> I suspect that my problems with optimize from Optim.jl could have 
>>>>>>>>> something to do with the gradient() function. In the example below, 
>>>>>>>>> for 
>>>>>>>>> instance, I do not understand why the output of the gradient function 
>>>>>>>>> includes values such as 11470.7, given that the function values 
>>>>>>>>> differ only 
>>>>>>>>> minimally.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>> Holger
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> julia> Optim.gradient(clogit_ll,zeros(4))
>>>>>>>>> 60554544523933395e-22
>>>>>>>>> 0Op
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.564009972584
>>>>>>>>> -60554544523933395e-22
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.565228435104
>>>>>>>>> 0
>>>>>>>>> 60554544523933395e-22
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.569064311248
>>>>>>>>> 0
>>>>>>>>> -60554544523933395e-22
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.560174904109
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> 60554544523933395e-22
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.63413848258
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> -60554544523933395e-22
>>>>>>>>> 0
>>>>>>>>>
>>>>>>>>> 14923.495218282553
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> 60554544523933395e-22
>>>>>>>>>
>>>>>>>>> 14923.58699717058
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> 0
>>>>>>>>> -60554544523933395e-22
>>>>>>>>>
>>>>>>>>> 14923.54224130672
>>>>>>>>> 4-element Array{Float64,1}:
>>>>>>>>>   -100.609
>>>>>>>>>    734.0
>>>>>>>>>  11470.7
>>>>>>>>>   3695.5
>>>>>>>>>
>>>>>>>>> function clogit_ll(beta::Vector)
>>>>>>>>>
>>>>>>>>>     # Print the parameters and the return value to
>>>>>>>>>     # check how gradient() and optimize() work.
>>>>>>>>>     println(beta) 
>>>>>>>>>     println(-sum(compute_ll(beta,T,0)))
>>>>>>>>>
>>>>>>>>>     # compute_ll computes the individual likelihood contributions
>>>>>>>>>     # in the sample. T is the number of periods in the panel. The 
>>>>>>>>> 0
>>>>>>>>>     # is not used in this simple example. In related functions, I
>>>>>>>>>     # pass on different values here to estimate finite mixtures of
>>>>>>>>>     # the conditional logit model.
>>>>>>>>>     return -sum(compute_ll(beta,T,0))
>>>>>>>>> end
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Med venlig hilsen
>>>>>>>>
>>>>>>>> Andreas Noack Jensen
>>>>>>>>  
>>>>>>>>
>>>>>>>>
>>>>>>>
>>

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to