Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

John Myles White Thu, 22 May 2014 15:04:08 -0700

Yeah, this case is tricky enough that we really need to get down to the lowest 
details:


(1) Do Julia and Matlab perform similar numbers of function evaluations?

(2) If they don't perform similar numbers of function evaluations, is one of 
them producing a better solution? Is the one that's producing a better solution 
doing more function evaluations?

(3) If they're doing different numbers of function evaluations and the one that 
does _fewer_ evaluations also produces a better solution, what's the reason? 
For example, is our line search default less effective for this problem than 
the Matlab line search? If you try other line search algorithms, do the results 
stay the same?

Unfortunately, answering all of these reliably make take us all pretty far down 
the rabbit hole. But they're worth pushing on systematically.

 -- John

On May 22, 2014, at 2:59 PM, Miles Lubin <[email protected]> wrote:

> I can get another 50% speedup by:
> 
> - Running the optimization twice and timing the second run only, this is the 
> more appropriate way to benchmark julia because it excludes the function 
> compilation time
> - Setting autodiff=true
> - Breaking up the long chains of sums, apparently these seem to be slow
> 
> At this point one really needs to compare the number of function evaluations 
> in each method, as John suggested.
> 
> On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote:
> Thanks, it's faster now (by roughly a factor of 3 on my computer), but still 
> considerably slower than fminunc:
> 
> Averages over 20 runs:
> Julia/Optim.optimize: 10.5s
> Matlab/fminunc: 2.6s
> 
> Here are my Matlab settings:
> options = optimset('Display', 'iter', ...
>      'MaxIter', 2500, 'MaxFunEvals', 500000, ...
>      'TolFun', 1e-6, 'TolX', 1e-6, ...
>      'GradObj', 'off', 'DerivativeCheck', 'off');
> 
> startb      = ones(1,nVar)';
> [estim_clo, ll_clo]= ...
>      fminunc(@(param)clogit_ll(param,data), ...
>      startb,options);
> 
> Could the speed issue be related to the following messages that I get when I 
> run the Julia code?
> C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
> Warning: could not import Base.foldl into NumericExtensions
> Warning: could not import Base.foldr into NumericExtensions
> Warning: could not import Base.sum! into NumericExtensions
> Warning: could not import Base.maximum! into NumericExtensions
> Warning: could not import Base.minimum! into NumericExtensions
> 
> 
> 
> 
> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin:
> I was able to get a nearly 5x speedup by avoiding the matrix allocation and 
> making the accumulators type stable: 
> https://gist.github.com/mlubin/055690ddf2466e98bba6
> 
> How does this compare with Matlab now?
> 
> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote:
> @ John: You are right, when I specify the function as clogit_ll(beta::Vector) 
> instead of clogit_ll(beta::Vector{Float64}), autodiff = true works fine. 
> Thanks for your help!
> 
> @ Tim: I have set the rather strict default convergence criteria of 
> Optim.optimize to Matlab's default values for fminunc, but the speed 
> difference is still there.
> 
> @ Miles/John: Getting rid of the global variables through closures and 
> devectorizing made the optimization _slower_ not faster in my case: 
> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was surprised to 
> see this as I expected a speed increase myself. 
>  
> 
> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin:
> Just to extend on what John said, also think that if you can restructure the 
> code to devectorize it and avoid using global variables, you'll see *much* 
> better performance. 
> 
> The way to avoid globals is by using closures, for example:
> function foo(x, data)
>     ...
> end
> 
> 
> ...
> data_raw = readcsv(file)
> data = reshape(data_raw, nObs, nChoices*(1+nVar), T)
> 
> 
> 
> Optim.optimize(x-> foo(x,data), ...)
> 
> 
> 
> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White wrote:
> Glad that you were able to figure out the source of your problems.
> 
> It would be good to get a sense of the amount of time spent inside your 
> objective function vs. the amount of time spent in the code for optimize(). 
> In general, my experience is that >>90% of the compute time for an 
> optimization problem is spent in the objective function itself. If you 
> instrument your objective function to produce timing information on each 
> call, that would help a lot since you could then get a sense of how much time 
> is being spent in the code for optimize() after accounting for your function 
> itself.
> 
> It’s also worth keeping in mind that your use of implicit finite differencing 
> means that your objective function is being called a lot more times than 
> theoretically necessary, so that any minor performance issue in it will very 
> substantially slow down the solver.
> 
> Regarding you objective function’s code, I suspect that the combination of 
> global variables and memory-allocating vectorized arithmetic means that your 
> objective function might be a good bit slower in Julia than in Matlab. Matlab 
> seems to be a little better about garbage collection for vectorized 
> arithmetic and Julia is generally not able to optimize code involving global 
> variables.
> 
> Hope that points you in the right direction.
> 
>  — John
> 
> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]> wrote:
> 
>> Hi Andreas,
>> hi John,
>> hi Miles (via julia-opt, where I mistakenly also posted my question 
>> yesterday),
>> 
>> Thanks for your help. Here is the link to the Gist I created: 
>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962
>> 
>> In the process of producing a minimal (non-)working example, I discovered 
>> that the unexpected results are due to the truncation of the logit choice 
>> probabilities in the objective function. Optim.optimize() is sensitive to 
>> this when method = :l_bfgs is used. With method = :nelder_mead, everything 
>> works fine. When I comment out the truncation, :l_bfgs works as well. 
>> However, I need to increase the xtol from its default of 1e-12 to at least 
>> 1e-10, otherwise I get the error that the linesearch failed to converge.
>> 
>> I guess I should just do without the truncation. The logit probabilities are 
>> between 0 and 1 by construction anyway. I had just copied the truncation 
>> code from a friend who had told me that probabilities that are too close to 
>> 0 or 1 sometimes cause numerical problems in his Matlab code of the same 
>> function. With Optim.optimize(), it seems to be the other way around, i.e. 
>> moving the probabilities further away from 0 or 1 (even by tiny amounts) 
>> means that the stability of the (gradient-based) algorithm is reduced.
>> 
>> So for me, the problem is solved. The problem was not with Optim.jl, but 
>> with my own code.
>> 
>> The only other thing that I discovered when trying out Julia and Optim.jl is 
>> that the optimization is currently considerably slower than Matlab's 
>> fminunc. From the Gist I provided above, are there any potential performance 
>> improvements that I am missing out on?
>> 
>> Best wishes,
>> Holger
>> 
>> 
>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:
>> If you can, please do share an example of your code. Logit-style models are 
>> in general numerically unstable, so it would be good to see how exactly 
>> you’ve coded things up.
>> 
>> One thing you may be able to do is use automatic differentiation via the 
>> autodiff = true keyword to optimize, but that assumes that your objective 
>> function is written in completely pure Julia code (which means, for example, 
>> that your code must not call any of functions not written in Julia provided 
>> by Distributions.jl).
>> 
>>  — John
>> 
>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen <[email protected]> 
>> wrote:
>> 
>>> What is the output of versioninfo() and Pkg.installed("Optim")? Also, would 
>>> it be possible to make a gist with your code?
>>> 
>>> 
>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>:
>>>  Hello,
>>> 
>>> I installed Julia a couple of days ago and was impressed how easy it was to 
>>> make the switch from Matlab and to parallelize my code
>>> (something I had never done before in any language; I'm an economist with 
>>> only limited programming experience, mainly in Stata and Matlab).
>>> 
>>> However, I ran into a problem when using Optim.jl for Maximum Likelihood 
>>> estimation of a conditional logit model. With the default Nelder-Mead 
>>> algorithm, optimize from the Optim.jl package gave me the same result that 
>>> I had obtained in Stata and Matlab.
>>> 
>>> With gradient-based methods such as BFGS, however, the algorithm jumped 
>>> from the starting values to parameter values that are completely different. 
>>> This happened for all thr starting values I tried, including the case in 
>>> which I took a vector that is closed to the optimum from the Nelder-Mead 
>>> algorithm.  
>>> 
>>> The problem seems to be that the algorithm tried values so large (in 
>>> absolute value) that this caused problems for the objective
>>> function, where I call exponential functions into which these parameter 
>>> values enter. As a result, the optimization based on the BFGS algorithm did 
>>> not produce the expected optimum.
>>> 
>>> While I could try to provide the analytical gradient in this simple case, I 
>>> was planning to use Julia for Maximum Likelihood or Simulated Maximum 
>>> Likelihood estimation in cases where the gradient is more difficult to 
>>> derive, so it would be good if I could make the optimizer run also with 
>>> numerical gradients.
>>> 
>>> I suspect that my problems with optimize from Optim.jl could have something 
>>> to do with the gradient() function. In the example below, for instance, I 
>>> do not understand why the output of the gradient function includes values 
>>> such as 11470.7, given that the function values differ only minimally.
>>> 
>>> Best wishes,
>>> Holger
>>> 
>>> 
>>> julia> Optim.gradient(clogit_ll,zeros(4))
>>> 60554544523933395e-22
>>> 0Op
>>> 0
>>> 0
>>> 
>>> 14923.564009972584
>>> -60554544523933395e-22
>>> 0
>>> 0
>>> 0
>>> 
>>> 14923.565228435104
>>> 0
>>> 60554544523933395e-22
>>> 0
>>> 0
>>> 
>>> 14923.569064311248
>>> 0
>>> -60554544523933395e-22
>>> 0
>>> 0
>>> 
>>> 14923.560174904109
>>> 0
>>> 0
>>> 60554544523933395e-22
>>> 0
>>> 
>>> 14923.63413848258
>>> 0
>>> 0
>>> -60554544523933395e-22
>>> 0
>>> 
>>> 14923.495218282553
>>> 0
>>> 0
>>> 0
>>> 60554544523933395e-22
>>> 
>>> 14923.58699717058
>>> 0
>>> 0
>>> 0
>>> -60554544523933395e-22
>>> 
>>> 14923.54224130672
>>> 4-element Array{Float64,1}:
>>>   -100.609
>>>    734.0
>>>  11470.7
>>>   3695.5
>>> 
>>> function clogit_ll(beta::Vector)
>>> 
>>>     # Print the parameters and the return value to
>>>     # check how gradient() and optimize() work.
>>>     println(beta) 
>>>     println(-sum(compute_ll(beta,T,0)))
>>> 
>>>     # compute_ll computes the individual likelihood contributions
>>>     # in the sample. T is the number of periods in the panel. The 0
>>>     # is not used in this simple example. In related functions, I
>>>     # pass on different values here to estimate finite mixtures of
>>>     # the conditional logit model.
>>>     return -sum(compute_ll(beta,T,0))
>>> end
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Med venlig hilsen
>>> 
>>> Andreas Noack Jensen
>> 
>

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to