Re: [math] refactoring least squares

Gilles Mon, 24 Feb 2014 14:48:07 -0800

On Mon, 24 Feb 2014 14:41:34 -0500, Evan Ward wrote:

On 02/24/2014 01:23 PM, Gilles wrote:

On Mon, 24 Feb 2014 11:49:26 -0500, Evan Ward wrote:

One way to improve performance would be to provide pre-allocatedspace
for the Jacobian and reuse it for each evaluation.


Do you have actual data to back this statement?

I did some tests with the CircleVectorial problem from the testcases.

The Jacobian is 1000x2, and I ran it 1000 times until hotspot stopped

making it run faster. The first column is the current state of thecode.

The second column is with one less matrix allocation each problem
evaluation.

       trunk    -1 alloc  %change
lu     0.90 s   0.74      -17%
chol   0.90     0.74      -17%
qr     0.87     0.70      -20%


I also see similar reductions in runtime using 1e6 observations and 3
observations. We could save 3-4 allocations per evaluation, which
extrapolates to 60%-80% in run time.


I would not have expected such a big impact.
How did you set up the benchmark? Actually, it would be a sane starting

point to create a "performance" unit test (see e.g."FastMathTestPerformance"

in package "o.a.c.m.util").

The
LeastSquaresProblem interface would then be:
void evaluate(RealVector point, RealVector resultResiduals,RealVector
resultJacobian);
I'm interested in hearing your ideas on other approaches to solvethis
issue. Or even if this is an issue worth solving.
Not before we can be sure that in-place modification (rather than
reallocation) always provides a performance benefit.


I would like to hear other ideas for improving the performance.

Design-wise, it is quite ugly to modify input parameters. I think thatit

could also hurt in the long run, by preventing other ways to improve
performance.

Why couldn't the reuse vs reallocate be delegated to implementations of
the "MultivariateJacobianFunction" interface?

Eventualy, doesn't it boil down to creating "RealVector" and"RealMatrix"

implementations that modify and return "this" rather than create a new
object?


Best Regards,
Gilles

Evan
On 02/24/2014 12:09 PM, Luc Maisonobe wrote:
Hi Evan,

Le 24/02/2014 17:49, Evan Ward a écrit :
I've looked into improving performance further, but it seems anyfurther
improvements will need big API changes for memory management.
Currently using Gauss-Newton with Cholesky (or LU) requires 4matrix
allocations _each_ evaluation. The objective function initially
allocates the Jacobian matrix. Then the weights are applied through
matrix multiplication, allocating a new matrix. Computing thenormalequations allocates a new matrix to hold the result, and finallythedecomposition allocates it's own matrix as a copy. With QR thereare 3matrix allocations each model function evaluation, since there isnoneed to compute the normal equations, but the third allocation+copyislarger. Some empirical sampling data I've collected with thejvisualvmtool indicates that matrix allocation and copying takes 30% to 80%of
the execution time, depending on the dimension of the Jacobian.
One way to improve performance would be to provide pre-allocatedspace
for the Jacobian and reuse it for each evaluation. The
LeastSquaresProblem interface would then be:
void evaluate(RealVector point, RealVector resultResiduals,RealVector
resultJacobian);
I'm interested in hearing your ideas on other approaches to solvethis
issue. Or even if this is an issue worth solving.
Yes, I think this issue is worth solving, especially since we aregoingto ship 3.3 and need to fix as much as possible before the release,thusavoiding future problems. Everything spotted now is worth fixingnow.
Your approach seems reasonable, as long as the work arrays arereallyallocated at the start of the optimization and shared only through afewdocumented methods like the one you propose. This would mean we cansayin the javadoc that these area should be used only to fulfill theAPIrequirements and not copied elsewhere, as they *will* be modified asthealgorithm run, and are explicitly devoted to avoid reallocation. Iguessthis kind of problems is more important when lots of observationsareperformed, which correspond to very frequent use case (at least inthe
fields I know about).
For the record, what you propose seems similar to what is done intheODE package, as the state vector and its first derivatives are alsokeptin preallocated arrays which are reused throughout the integrationandare used to exchange data between the Apache Commons Math algorithmandthe user problem to be solved. So it is somehting we already doelsewhere.
OK. We could keep the Evaluation interface, which would justreference
the pre-allocated residuals and matrix. If the result parameters are
null the LSP could allocate a matrix of the correct sizeautomatically.
So then the interface would look like:

Evaluation evaluate(RealVector point, RealVector resultResiduals,
RealVector resultJacobian);
best regards,
Luc
Best Regards,
Evan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
On 02/24/2014 01:16 PM, Gilles wrote:
On Mon, 24 Feb 2014 18:09:26 +0100, Luc Maisonobe wrote:
Hi Evan,

Le 24/02/2014 17:49, Evan Ward a écrit :
I've looked into improving performance further, but it seems any
further
improvements will need big API changes for memory management.
Currently using Gauss-Newton with Cholesky (or LU) requires 4matrix
allocations _each_ evaluation. The objective function initially
allocates the Jacobian matrix. Then the weights are appliedthroughmatrix multiplication, allocating a new matrix. Computing thenormalequations allocates a new matrix to hold the result, and finallythedecomposition allocates it's own matrix as a copy. With QR thereare 3matrix allocations each model function evaluation, since there isnoneed to compute the normal equations, but the thirdallocation+copy islarger. Some empirical sampling data I've collected with thejvisualvmtool indicates that matrix allocation and copying takes 30% to 80%of
the execution time, depending on the dimension of the Jacobian.
One way to improve performance would be to provide pre-allocatedspace
for the Jacobian and reuse it for each evaluation. The
LeastSquaresProblem interface would then be:
void evaluate(RealVector point, RealVector resultResiduals,RealVector
resultJacobian);
I'm interested in hearing your ideas on other approaches to solvethis
issue. Or even if this is an issue worth solving.
Yes, I think this issue is worth solving, especially since we aregoingto ship 3.3 and need to fix as much as possible before the release,thusavoiding future problems. Everything spotted now is worth fixingnow.
Your approach seems reasonable, as long as the work arrays arereallyallocated at the start of the optimization and shared only througha fewdocumented methods like the one you propose. This would mean we cansayin the javadoc that these area should be used only to fulfill theAPIrequirements and not copied elsewhere, as they *will* be modifiedas thealgorithm run, and are explicitly devoted to avoid reallocation. Iguessthis kind of problems is more important when lots of observationsareperformed, which correspond to very frequent use case (at least inthe
fields I know about).
For the record, what you propose seems similar to what is done intheODE package, as the state vector and its first derivatives are alsokeptin preallocated arrays which are reused throughout the integrationandare used to exchange data between the Apache Commons Math algorithmand
the user problem to be solved. So it is somehting we already do
elsewhere.
If I understand correctly what is being discussed, I do not agreewith
this approach.

The optimization/fitting algorithms must use matrix abstractions.
If performance improvements can achieved, they must happen at thelevel
of the appropriate matrix implementations.
The matrix abstractions will still be used in the interface. As faras Ican tell none of the optimizers or linear algebra classes use thematrixabstractions internally. For example LU, QR, and Cholesky all copythe
matrix data to an internal double[][]. I tried computing the normal
equation in GaussNewton as j.transpose().multiply(j), but the
performance was bad because j.transpose() creates a copy of thematrix.
That's why we have the current ugly for loop implementation with
getEntry() and setEntry(). Maybe matrix "views" could help solve theissue.
Best regards,
Gilles



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Regards,
Evan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] refactoring least squares

Reply via email to