Re: [math] refactoring least squares

Konstantin Berlin Tue, 25 Feb 2014 06:23:38 -0800

Hi,

I am really having problems believing that matrix copying is the major
problem in an optimization algorithm. Copying is O(N^2) operations. Surely,
for any problem where performance would matter, it is completely dwarfed by
the O(N^3) complexity of actually solving the normal equation.


Also, I think testing should be done on an actual large problem where
scaling issuing would show up. The 1000x2 jaccobian would results in a 2x2
normal equation. Surely this is not a good test case.

Konstantin


On Tue, Feb 25, 2014 at 8:35 AM, Evan Ward <evan.w...@nrl.navy.mil> wrote:

>
> On 02/24/2014 05:47 PM, Gilles wrote:
> > On Mon, 24 Feb 2014 14:41:34 -0500, Evan Ward wrote:
> >> On 02/24/2014 01:23 PM, Gilles wrote:
> >>> On Mon, 24 Feb 2014 11:49:26 -0500, Evan Ward wrote:
> >>>> One way to improve performance would be to provide pre-allocated space
> >>>> for the Jacobian and reuse it for each evaluation.
> >>>
> >>> Do you have actual data to back this statement?
> >>
> >> I did some tests with the CircleVectorial problem from the test cases.
> >> The Jacobian is 1000x2, and I ran it 1000 times until hotspot stopped
> >> making it run faster. The first column is the current state of the code.
> >> The second column is with one less matrix allocation each problem
> >> evaluation.
> >>
> >>        trunk    -1 alloc  %change
> >> lu     0.90 s   0.74      -17%
> >> chol   0.90     0.74      -17%
> >> qr     0.87     0.70      -20%
> >>
> >>
> >> I also see similar reductions in runtime using 1e6 observations and 3
> >> observations. We could save 3-4 allocations per evaluation, which
> >> extrapolates to 60%-80% in run time.
> >
> > I would not have expected such a big impact.
> > How did you set up the benchmark? Actually, it would be a sane starting
> > point to create a "performance" unit test (see e.g.
> > "FastMathTestPerformance"
> > in package "o.a.c.m.util").
>
> The test I used: https://gist.github.com/wardev/3f183dfaaf297dc2462c
>
> I also modified CircleVectorial to use the MultivariateJacobianFunction
> to avoid the translation copies.
>
> >
> >>
> >>>> The
> >>>> LeastSquaresProblem interface would then be:
> >>>>
> >>>> void evaluate(RealVector point, RealVector resultResiduals, RealVector
> >>>> resultJacobian);
> >>>>
> >>>> I'm interested in hearing your ideas on other approaches to solve this
> >>>> issue. Or even if this is an issue worth solving.
> >>>
> >>> Not before we can be sure that in-place modification (rather than
> >>> reallocation) always provides a performance benefit.
> >>
> >> I would like to hear other ideas for improving the performance.
> >
> > Design-wise, it is quite ugly to modify input parameters. I think that it
> > could also hurt in the long run, by preventing other ways to improve
> > performance.
>
> I agree, it smells like fortran. ;)
>
> >
> > Why couldn't the reuse vs reallocate be delegated to implementations of
> > the "MultivariateJacobianFunction" interface?
>
> That is the change I made to produce the performance numbers. It works
> well as long as the application has a single thread. Once it is
> multi-threaded we would have to create a new Function for each thread.
> I'm reluctant to do that because I think not copying the whole problem
> for each thread is a nice feature.
>
> Really I think we need 1 matrix per optimization context. The question
> is: How does the Function, which is several method calls below the
> optimize() method, access the current optimization context? If we reuse
> the matrix at the Function level then we are implicitly tying the
> optimization context to the thread context. By passing the matrices as
> parameters we make the context explicit, but we clutter up the API.
> Perhaps there is a better solution we haven't thought of yet.
>
> > Eventualy, doesn't it boil down to creating "RealVector" and "RealMatrix"
> > implementations that modify and return "this" rather than create a new
> > object?
>
> Interesting idea. I think this will remove the allocation+copy when
> applying weights. That just leaves where matrix decompositions create a
> copy of the input matrix.
>
> Best Regards,
> Evan
>
> >
> >
> > Best Regards,
> > Gilles
> >
> >>>
> >>>> Evan
> >>>
> >>
> >> On 02/24/2014 12:09 PM, Luc Maisonobe wrote:
> >>> Hi Evan,
> >>>
> >>> Le 24/02/2014 17:49, Evan Ward a écrit :
> >>>> I've looked into improving performance further, but it seems any
> >>>> further
> >>>> improvements will need big API changes for memory management.
> >>>>
> >>>> Currently using Gauss-Newton with Cholesky (or LU) requires 4 matrix
> >>>> allocations _each_ evaluation. The objective function initially
> >>>> allocates the Jacobian matrix. Then the weights are applied through
> >>>> matrix multiplication, allocating a new matrix. Computing the normal
> >>>> equations allocates a new matrix to hold the result, and finally the
> >>>> decomposition allocates it's own matrix as a copy. With QR there are 3
> >>>> matrix allocations each model function evaluation, since there is no
> >>>> need to compute the normal equations, but the third allocation+copy is
> >>>> larger. Some empirical sampling data I've collected with the jvisualvm
> >>>> tool indicates that matrix allocation and copying takes 30% to 80% of
> >>>> the execution time, depending on the dimension of the Jacobian.
> >>>>
> >>>> One way to improve performance would be to provide pre-allocated space
> >>>> for the Jacobian and reuse it for each evaluation. The
> >>>> LeastSquaresProblem interface would then be:
> >>>>
> >>>> void evaluate(RealVector point, RealVector resultResiduals, RealVector
> >>>> resultJacobian);
> >>>>
> >>>> I'm interested in hearing your ideas on other approaches to solve this
> >>>> issue. Or even if this is an issue worth solving.
> >>> Yes, I think this issue is worth solving, especially since we are going
> >>> to ship 3.3 and need to fix as much as possible before the release,
> >>> thus
> >>> avoiding future problems. Everything spotted now is worth fixing now.
> >>>
> >>> Your approach seems reasonable, as long as the work arrays are really
> >>> allocated at the start of the optimization and shared only through a
> >>> few
> >>> documented methods like the one you propose. This would mean we can say
> >>> in the javadoc that these area should be used only to fulfill the API
> >>> requirements and not copied elsewhere, as they *will* be modified as
> >>> the
> >>> algorithm run, and are explicitly devoted to avoid reallocation. I
> >>> guess
> >>> this kind of problems is more important when lots of observations are
> >>> performed, which correspond to very frequent use case (at least in the
> >>> fields I know about).
> >>>
> >>> For the record, what you propose seems similar to what is done in the
> >>> ODE package, as the state vector and its first derivatives are also
> >>> kept
> >>> in preallocated arrays which are reused throughout the integration and
> >>> are used to exchange data between the Apache Commons Math algorithm and
> >>> the user problem to be solved. So it is somehting we already do
> >>> elsewhere.
> >>
> >> OK. We could keep the Evaluation interface, which would just reference
> >> the pre-allocated residuals and matrix. If the result parameters are
> >> null the LSP could allocate a matrix of the correct size automatically.
> >> So then the interface would look like:
> >>
> >> Evaluation evaluate(RealVector point, RealVector resultResiduals,
> >> RealVector resultJacobian);
> >>
> >>>
> >>> best regards,
> >>> Luc
> >>>
> >>>> Best Regards,
> >>>> Evan
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>>
> >>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>
> >>
> >>
> >> On 02/24/2014 01:16 PM, Gilles wrote:
> >>> On Mon, 24 Feb 2014 18:09:26 +0100, Luc Maisonobe wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Le 24/02/2014 17:49, Evan Ward a écrit :
> >>>>> I've looked into improving performance further, but it seems any
> >>>>> further
> >>>>> improvements will need big API changes for memory management.
> >>>>>
> >>>>> Currently using Gauss-Newton with Cholesky (or LU) requires 4 matrix
> >>>>> allocations _each_ evaluation. The objective function initially
> >>>>> allocates the Jacobian matrix. Then the weights are applied through
> >>>>> matrix multiplication, allocating a new matrix. Computing the normal
> >>>>> equations allocates a new matrix to hold the result, and finally the
> >>>>> decomposition allocates it's own matrix as a copy. With QR there
> >>>>> are 3
> >>>>> matrix allocations each model function evaluation, since there is no
> >>>>> need to compute the normal equations, but the third
> >>>>> allocation+copy is
> >>>>> larger. Some empirical sampling data I've collected with the
> >>>>> jvisualvm
> >>>>> tool indicates that matrix allocation and copying takes 30% to 80% of
> >>>>> the execution time, depending on the dimension of the Jacobian.
> >>>>>
> >>>>> One way to improve performance would be to provide pre-allocated
> >>>>> space
> >>>>> for the Jacobian and reuse it for each evaluation. The
> >>>>> LeastSquaresProblem interface would then be:
> >>>>>
> >>>>> void evaluate(RealVector point, RealVector resultResiduals,
> >>>>> RealVector
> >>>>> resultJacobian);
> >>>>>
> >>>>> I'm interested in hearing your ideas on other approaches to solve
> >>>>> this
> >>>>> issue. Or even if this is an issue worth solving.
> >>>>
> >>>> Yes, I think this issue is worth solving, especially since we are
> >>>> going
> >>>> to ship 3.3 and need to fix as much as possible before the release,
> >>>> thus
> >>>> avoiding future problems. Everything spotted now is worth fixing now.
> >>>>
> >>>> Your approach seems reasonable, as long as the work arrays are really
> >>>> allocated at the start of the optimization and shared only through
> >>>> a few
> >>>> documented methods like the one you propose. This would mean we can
> >>>> say
> >>>> in the javadoc that these area should be used only to fulfill the API
> >>>> requirements and not copied elsewhere, as they *will* be modified
> >>>> as the
> >>>> algorithm run, and are explicitly devoted to avoid reallocation. I
> >>>> guess
> >>>> this kind of problems is more important when lots of observations are
> >>>> performed, which correspond to very frequent use case (at least in the
> >>>> fields I know about).
> >>>>
> >>>> For the record, what you propose seems similar to what is done in the
> >>>> ODE package, as the state vector and its first derivatives are also
> >>>> kept
> >>>> in preallocated arrays which are reused throughout the integration and
> >>>> are used to exchange data between the Apache Commons Math algorithm
> >>>> and
> >>>> the user problem to be solved. So it is somehting we already do
> >>>> elsewhere.
> >>>
> >>> If I understand correctly what is being discussed, I do not agree with
> >>> this approach.
> >>>
> >>> The optimization/fitting algorithms must use matrix abstractions.
> >>> If performance improvements can achieved, they must happen at the level
> >>> of the appropriate matrix implementations.
> >>>
> >>
> >> The matrix abstractions will still be used in the interface. As far as I
> >> can tell none of the optimizers or linear algebra classes use the matrix
> >> abstractions internally. For example LU, QR, and Cholesky all copy the
> >> matrix data to an internal double[][]. I tried computing the normal
> >> equation in GaussNewton as j.transpose().multiply(j), but the
> >> performance was bad because j.transpose() creates a copy of the matrix.
> >> That's why we have the current ugly for loop implementation with
> >> getEntry() and setEntry(). Maybe matrix "views" could help solve the
> >> issue.
> >>
> >>>
> >>> Best regards,
> >>> Gilles
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>
> >>
> >> Regards,
> >> Evan
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >> For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: [math] refactoring least squares

Reply via email to