Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Tim Holy Sun, 22 Jun 2014 07:58:28 -0700

If x1, ..., x6 or coeff are Float64 arrays, then the initialization

    u1 = 0; u2 = 0; u3 = 0; u4 = 0; u5 = 0; u6 = 0


is problematic as soon as you get to

        for k=1:nVar
            u1 += x1[i + ni*( k-1 + nk* (t-1))]*coeff[k]
            u2 += x2[i + ni*( k-1 + nk* (t-1))]*coeff[k]
            u3 += x3[i + ni*( k-1 + nk* (t-1))]*coeff[k]
            u4 += x4[i + ni*( k-1 + nk* (t-1))]*coeff[k]
            u5 += x5[i + ni*( k-1 + nk* (t-1))]*coeff[k]
            u6 += x6[i + ni*( k-1 + nk* (t-1))]*coeff[k]
        end

because you're initializing them to be integers but then they get converted 
into Float64. A more careful approach is to do something like this:

    T = typeof(one(eltype(x1))*one(eltype(coeff))
    TT = typeof(one(T) + one(T))
    u1 = u2 = u3 = u4 = u5 = u6 = zero(TT)

In general, code_typed is your friend: look for Union types.

    T = Vector{Float64}
   code_typed(ll, (T, T, T, T, T, T, T, T, T, T, T, T, T, Float64, Int, Int))

and you'll see Union types all over the place. (TypeCheck also, but it didn't 
seem to pick up this error.) And see the Performance Tips section of the 
manual.

--Tim


On Sunday, June 22, 2014 04:50:16 AM Thibaut Lamadon wrote:
> Hi guys,
> 
> I wanted to look into this as well. The main issue I think is in the speed
> of the objective function. Running @time on the objective function
> suggested a large amount of byte allocation. Checking the type revealed
> that getting x and y from data would set their types to Any.
> 
> So I convert the data to {Float64,3}, and then I changed to only store
> cumulative sum, not the vector of likelihood. I run the objective function
> 50 times.
> 
> without the convert I get a total time of 9.18s
> with the convert and original function I get 4.15s
> with the convert and the new function I get 1.49s
> with matlab, I get 0.64s
> 
> so matlab still appears to be 2.5 times faster. But I am guessing matlab is
> using SIMD instructions when computing matrix multiplications. So we would
> need to try to use BLAS in julia with matrix multiplication to get a very
> good comparison.
> 
> Anyway, fixing the type of the input, and just summing inside the loop
> gives a 6x speed up.
> 
> PS: Running the full optimization
> with the convert and the new function I get 4.8s
> with my matlab I get 4s
> 
> I could not commit to Holger gist, so I forked
> it: https://gist.github.com/tlamadon/58c47c115f8cf2388e89
> please check that I have not done anything stupid, but output seemed
> similar.
> 
> Holger, I hope you are having a good time at home (or in Paris?),  And a
> world cup note: Allez les bleus!
> 
> very best,
> 
> t.
> 
> On Tuesday, 27 May 2014 14:03:30 UTC+1, Holger Stichnoth wrote:
> > Hi John, hi Miles,
> > 
> > Thanks to both of you. I did not have time to look into this over the
> > weekend; I will do so in the next couple of days. I have already uploaded
> > the Matlab files for comparison:
> > https://gist.github.com/stichnoth/7f251ded83dcaa384273
> > 
> > Holger
> > 
> > On Thursday, 22 May 2014 23:03:58 UTC+1, John Myles White wrote:
> >> Yeah, this case is tricky enough that we really need to get down to the
> >> lowest details:
> >> 
> >> (1) Do Julia and Matlab perform similar numbers of function evaluations?
> >> 
> >> (2) If they don't perform similar numbers of function evaluations, is one
> >> of them producing a better solution? Is the one that's producing a better
> >> solution doing more function evaluations?
> >> 
> >> (3) If they're doing different numbers of function evaluations and the
> >> one that does _fewer_ evaluations also produces a better solution, what's
> >> the reason? For example, is our line search default less effective for
> >> this
> >> problem than the Matlab line search? If you try other line search
> >> algorithms, do the results stay the same?
> >> 
> >> Unfortunately, answering all of these reliably make take us all pretty
> >> far down the rabbit hole. But they're worth pushing on systematically.
> >> 
> >>  -- John
> >> 
> >> On May 22, 2014, at 2:59 PM, Miles Lubin <[email protected]> wrote:
> >> 
> >> I can get another 50% speedup by:
> >> 
> >> - Running the optimization twice and timing the second run only, this is
> >> the more appropriate way to benchmark julia because it excludes the
> >> function compilation time
> >> - Setting autodiff=true
> >> - Breaking up the long chains of sums, apparently these seem to be slow
> >> 
> >> At this point one really needs to compare the number of function
> >> evaluations in each method, as John suggested.
> >> 
> >> On Thursday, May 22, 2014 9:53:36 AM UTC-4, Holger Stichnoth wrote:
> >>> Thanks, it's faster now (by roughly a factor of 3 on my computer), but
> >>> still considerably slower than fminunc:
> >>> 
> >>> Averages over 20 runs:
> >>> Julia/Optim.optimize: 10.5s
> >>> Matlab/fminunc: 2.6s
> >>> 
> >>> Here are my Matlab settings:
> >>> options = optimset('Display', 'iter', ...
> >>> 
> >>>      'MaxIter', 2500, 'MaxFunEvals', 500000, ...
> >>>      'TolFun', 1e-6, 'TolX', 1e-6, ...
> >>>      'GradObj', 'off', 'DerivativeCheck', 'off');
> >>> 
> >>> startb      = ones(1,nVar)';
> >>> [estim_clo, ll_clo]= ...
> >>> 
> >>>      fminunc(@(param)clogit_ll(param,data), ...
> >>>      startb,options);
> >>> 
> >>> Could the speed issue be related to the following messages that I get
> >>> when I run the Julia code?
> >>> C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
> >>> Warning: could not import Base.foldl into NumericExtensions
> >>> Warning: could not import Base.foldr into NumericExtensions
> >>> Warning: could not import Base.sum! into NumericExtensions
> >>> Warning: could not import Base.maximum! into NumericExtensions
> >>> Warning: could not import Base.minimum! into NumericExtensions
> >>> 
> >>> Am Donnerstag, 22. Mai 2014 14:18:36 UTC+1 schrieb Miles Lubin:
> >>>> I was able to get a nearly 5x speedup by avoiding the matrix allocation
> >>>> and making the accumulators type stable:
> >>>> https://gist.github.com/mlubin/055690ddf2466e98bba6
> >>>> 
> >>>> How does this compare with Matlab now?
> >>>> 
> >>>> On Thursday, May 22, 2014 6:38:44 AM UTC-4, Holger Stichnoth wrote:
> >>>>> @ John: You are right, when I specify the function as
> >>>>> clogit_ll(beta::Vector) instead of clogit_ll(beta::Vector{Float64}),
> >>>>> autodiff = true works fine. Thanks for your help!
> >>>>> 
> >>>>> @ Tim: I have set the rather strict default convergence criteria of
> >>>>> Optim.optimize to Matlab's default values for fminunc, but the speed
> >>>>> difference is still there.
> >>>>> 
> >>>>> @ Miles/John: Getting rid of the global variables through closures and
> >>>>> devectorizing made the optimization _slower_ not faster in my case:
> >>>>> https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was
> >>>>> surprised to see this as I expected a speed increase myself.
> >>>>> 
> >>>>> Am Mittwoch, 21. Mai 2014 16:48:51 UTC+1 schrieb Miles Lubin:
> >>>>>> Just to extend on what John said, also think that if you can
> >>>>>> restructure the code to devectorize it and avoid using global
> >>>>>> variables,
> >>>>>> you'll see *much* better performance.
> >>>>>> 
> >>>>>> The way to avoid globals is by using closures, for example:
> >>>>>> function foo(x, data)
> >>>>>> 
> >>>>>>     ...
> >>>>>> 
> >>>>>> end
> >>>>>> 
> >>>>>> 
> >>>>>> ...
> >>>>>> data_raw = readcsv(file)
> >>>>>> data = reshape(data_raw, nObs, nChoices*(1+nVar), T)
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> Optim.optimize(x-> foo(x,data), ...)
> >>>>>> 
> >>>>>> On Tuesday, May 20, 2014 11:47:39 AM UTC-4, John Myles White wrote:
> >>>>>>> Glad that you were able to figure out the source of your problems.
> >>>>>>> 
> >>>>>>> It would be good to get a sense of the amount of time spent inside
> >>>>>>> your objective function vs. the amount of time spent in the code for
> >>>>>>> optimize(). In general, my experience is that >>90% of the compute
> >>>>>>> time for
> >>>>>>> an optimization problem is spent in the objective function itself.
> >>>>>>> If you
> >>>>>>> instrument your objective function to produce timing information on
> >>>>>>> each
> >>>>>>> call, that would help a lot since you could then get a sense of how
> >>>>>>> much
> >>>>>>> time is being spent in the code for optimize() after accounting for
> >>>>>>> your
> >>>>>>> function itself.
> >>>>>>> 
> >>>>>>> It’s also worth keeping in mind that your use of implicit finite
> >>>>>>> differencing means that your objective function is being called a
> >>>>>>> lot more
> >>>>>>> times than theoretically necessary, so that any minor performance
> >>>>>>> issue in
> >>>>>>> it will very substantially slow down the solver.
> >>>>>>> 
> >>>>>>> Regarding you objective function’s code, I suspect that the
> >>>>>>> combination of global variables and memory-allocating vectorized
> >>>>>>> arithmetic
> >>>>>>> means that your objective function might be a good bit slower in
> >>>>>>> Julia than
> >>>>>>> in Matlab. Matlab seems to be a little better about garbage
> >>>>>>> collection for
> >>>>>>> vectorized arithmetic and Julia is generally not able to optimize
> >>>>>>> code
> >>>>>>> involving global variables.
> >>>>>>> 
> >>>>>>> Hope that points you in the right direction.
> >>>>>>> 
> >>>>>>>  — John
> >>>>>>> 
> >>>>>>> On May 20, 2014, at 8:34 AM, Holger Stichnoth <[email protected]>
> >>>>>>> wrote:
> >>>>>>> 
> >>>>>>> Hi Andreas,
> >>>>>>> hi John,
> >>>>>>> hi Miles (via julia-opt, where I mistakenly also posted my question
> >>>>>>> yesterday),
> >>>>>>> 
> >>>>>>> Thanks for your help. Here is the link to the Gist I created:
> >>>>>>> https://gist.github.com/anonymous/5f95ab1afd241c0a5962
> >>>>>>> 
> >>>>>>> In the process of producing a minimal (non-)working example, I
> >>>>>>> discovered that the unexpected results are due to the truncation of
> >>>>>>> the
> >>>>>>> logit choice probabilities in the objective function.
> >>>>>>> Optim.optimize() is
> >>>>>>> sensitive to this when method = :l_bfgs is used. With method =
> >>>>>>> 
> >>>>>>> :nelder_mead, everything works fine. When I comment out the
> >>>>>>> :truncation,
> >>>>>>> :l_bfgs works as well. However, I need to increase the xtol from its
> >>>>>>> 
> >>>>>>> default of 1e-12 to at least 1e-10, otherwise I get the error that
> >>>>>>> the
> >>>>>>> linesearch failed to converge.
> >>>>>>> 
> >>>>>>> I guess I should just do without the truncation. The logit
> >>>>>>> probabilities are between 0 and 1 by construction anyway. I had just
> >>>>>>> copied
> >>>>>>> the truncation code from a friend who had told me that probabilities
> >>>>>>> that
> >>>>>>> are too close to 0 or 1 sometimes cause numerical problems in his
> >>>>>>> Matlab
> >>>>>>> code of the same function. With Optim.optimize(), it seems to be the
> >>>>>>> other
> >>>>>>> way around, i.e. moving the probabilities further away from 0 or 1
> >>>>>>> (even by
> >>>>>>> tiny amounts) means that the stability of the (gradient-based)
> >>>>>>> algorithm is
> >>>>>>> reduced.
> >>>>>>> 
> >>>>>>> So for me, the problem is solved. The problem was not with Optim.jl,
> >>>>>>> but with my own code.
> >>>>>>> 
> >>>>>>> The only other thing that I discovered when trying out Julia and
> >>>>>>> Optim.jl is that the optimization is currently considerably slower
> >>>>>>> than
> >>>>>>> Matlab's fminunc. From the Gist I provided above, are there any
> >>>>>>> potential
> >>>>>>> performance improvements that I am missing out on?
> >>>>>>> 
> >>>>>>> Best wishes,
> >>>>>>> Holger
> >>>>>>> 
> >>>>>>> On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:
> >>>>>>>> If you can, please do share an example of your code. Logit-style
> >>>>>>>> models are in general numerically unstable, so it would be good to
> >>>>>>>> see how
> >>>>>>>> exactly you’ve coded things up.
> >>>>>>>> 
> >>>>>>>> One thing you may be able to do is use automatic differentiation
> >>>>>>>> via the autodiff = true keyword to optimize, but that assumes that
> >>>>>>>> your
> >>>>>>>> objective function is written in completely pure Julia code (which
> >>>>>>>> means,
> >>>>>>>> for example, that your code must not call any of functions not
> >>>>>>>> written in
> >>>>>>>> Julia provided by Distributions.jl).
> >>>>>>>> 
> >>>>>>>>  — John
> >>>>>>>> 
> >>>>>>>> On May 19, 2014, at 4:09 AM, Andreas Noack Jensen <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>> 
> >>>>>>>> What is the output of versioninfo() and Pkg.installed("Optim")?
> >>>>>>>> Also, would it be possible to make a gist with your code?
> >>>>>>>> 
> >>>>>>>> 2014-05-19 12:44 GMT+02:00 Holger Stichnoth <[email protected]>:
> >>>>>>>>>  Hello,
> >>>>>>>>> 
> >>>>>>>>> I installed Julia a couple of days ago and was impressed how easy
> >>>>>>>>> it was to make the switch from Matlab and to parallelize my code
> >>>>>>>>> (something I had never done before in any language; I'm an
> >>>>>>>>> economist with only limited programming experience, mainly in
> >>>>>>>>> Stata and
> >>>>>>>>> Matlab).
> >>>>>>>>> 
> >>>>>>>>> However, I ran into a problem when using Optim.jl for Maximum
> >>>>>>>>> Likelihood estimation of a conditional logit model. With the
> >>>>>>>>> default
> >>>>>>>>> Nelder-Mead algorithm, optimize from the Optim.jl package gave me
> >>>>>>>>> the same
> >>>>>>>>> result that I had obtained in Stata and Matlab.
> >>>>>>>>> 
> >>>>>>>>> With gradient-based methods such as BFGS, however, the algorithm
> >>>>>>>>> jumped from the starting values to parameter values that are
> >>>>>>>>> completely
> >>>>>>>>> different. This happened for all thr starting values I tried,
> >>>>>>>>> including the
> >>>>>>>>> case in which I took a vector that is closed to the optimum from
> >>>>>>>>> the
> >>>>>>>>> Nelder-Mead algorithm.
> >>>>>>>>> 
> >>>>>>>>> The problem seems to be that the algorithm tried values so large
> >>>>>>>>> (in absolute value) that this caused problems for the objective
> >>>>>>>>> function, where I call exponential functions into which these
> >>>>>>>>> parameter values enter. As a result, the optimization based on the
> >>>>>>>>> BFGS
> >>>>>>>>> algorithm did not produce the expected optimum.
> >>>>>>>>> 
> >>>>>>>>> While I could try to provide the analytical gradient in this
> >>>>>>>>> simple case, I was planning to use Julia for Maximum Likelihood or
> >>>>>>>>> Simulated Maximum Likelihood estimation in cases where the
> >>>>>>>>> gradient is more
> >>>>>>>>> difficult to derive, so it would be good if I could make the
> >>>>>>>>> optimizer run
> >>>>>>>>> also with numerical gradients.
> >>>>>>>>> 
> >>>>>>>>> I suspect that my problems with optimize from Optim.jl could have
> >>>>>>>>> something to do with the gradient() function. In the example
> >>>>>>>>> below, for
> >>>>>>>>> instance, I do not understand why the output of the gradient
> >>>>>>>>> function
> >>>>>>>>> includes values such as 11470.7, given that the function values
> >>>>>>>>> differ only
> >>>>>>>>> minimally.
> >>>>>>>>> 
> >>>>>>>>> Best wishes,
> >>>>>>>>> Holger
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> julia> Optim.gradient(clogit_ll,zeros(4))
> >>>>>>>>> 60554544523933395e-22
> >>>>>>>>> 0Op
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.564009972584
> >>>>>>>>> -60554544523933395e-22
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.565228435104
> >>>>>>>>> 0
> >>>>>>>>> 60554544523933395e-22
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.569064311248
> >>>>>>>>> 0
> >>>>>>>>> -60554544523933395e-22
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.560174904109
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 60554544523933395e-22
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.63413848258
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> -60554544523933395e-22
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>> 14923.495218282553
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 60554544523933395e-22
> >>>>>>>>> 
> >>>>>>>>> 14923.58699717058
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> 0
> >>>>>>>>> -60554544523933395e-22
> >>>>>>>>> 
> >>>>>>>>> 14923.54224130672
> >>>>>>>>> 
> >>>>>>>>> 4-element Array{Float64,1}:
> >>>>>>>>>   -100.609
> >>>>>>>>>   
> >>>>>>>>>    734.0
> >>>>>>>>>  
> >>>>>>>>>  11470.7
> >>>>>>>>>  
> >>>>>>>>>   3695.5
> >>>>>>>>> 
> >>>>>>>>> function clogit_ll(beta::Vector)
> >>>>>>>>> 
> >>>>>>>>>     # Print the parameters and the return value to
> >>>>>>>>>     # check how gradient() and optimize() work.
> >>>>>>>>>     println(beta)
> >>>>>>>>>     println(-sum(compute_ll(beta,T,0)))
> >>>>>>>>>     
> >>>>>>>>>     # compute_ll computes the individual likelihood contributions
> >>>>>>>>>     # in the sample. T is the number of periods in the panel. The
> >>>>>>>>> 
> >>>>>>>>> 0
> >>>>>>>>> 
> >>>>>>>>>     # is not used in this simple example. In related functions, I
> >>>>>>>>>     # pass on different values here to estimate finite mixtures of
> >>>>>>>>>     # the conditional logit model.
> >>>>>>>>>     return -sum(compute_ll(beta,T,0))
> >>>>>>>>> 
> >>>>>>>>> end
> >>>>>>>> 
> >>>>>>>> Andreas Noack Jensen

Re: [julia-users] Optim.jl: unexpected results when using gradient-based methods

Reply via email to