[ 
https://issues.apache.org/jira/browse/MATH-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060782#comment-13060782
 ] 

greg sterijevski commented on MATH-607:
---------------------------------------

Sorry for duplicating part of my response, but gmail has truncated it (maybe 
google is telling me something about my ideas... ;0 )

My complete response is:

I agree on eliminating getRedundant() and isRedundant(int idx). If the 
underlying solver is QR or Gaussian this info would exist. If the underlying 
method is SVD, then we would register the rank reduction, but we would not be 
able to attribute it to a particular column in the design matrix.

I am probably in agreement with with making RegressionResults concrete, but 
there were a couple of considerations which forced me to interface.

Say that I begin with the following augmented matrix:
 | X'X     X'Y|
 | X'Y    Y'Y|
  where X is the design matrix ( nobs x nreg ), Y is the dependent variable 
(nobs x 1 )

On a copy of the cross products matrix (the thing above), I get the following 
via gaussian elimination:

 | inv(X'X)     -beta|
 | -beta           e'e|

inv(X'X) is the inverse of the X'X matrix. -beta is the OLS vector of slopes. 
e'e is the sum of squared errors.

Getting most of the info (that RegressionResults surfaces) is simply a matter 
of indexing. All I need to do in this case is write a wrapper around a 
symmetric matrix which implements the interface.

I suppose that there could be constructor which took the matrix above and did 
the indexing, but that seems too dirty. Furthermore, there are probably other 
optimized formats for OLS which have similar aspects. I wanted to keep the door 
open to other schemes, without making (potentially large) copies of variance 
matrices, standard errors and so forth a necessity.


On the name of the getter for number of observations, I am okay with whatever 
you feel is a better name.
 

    Regarding the model interface, I would again suggest that we just define 
this as a class, UpdatingOLSRegression.  I suppose that if we end up 
implementing a weighted or other non-OLS version, we might want to factor out a 
common interface like what exists for MultipleLinearRegression, but in 
retrospect, I am not sure that interface was worth much.  Note that all that we 
could factor out is essentially what is in MultivariateRegression, which is 
analogous to your RegressionResults.


So you are saying the UpdatingOLSRegression be an abstract class? There are not 
that many methods in the interface. That would be okay if were sure that 
subclasses always overrode either the regress(...) methods or the 
addObservations(...) methods. I worry that you might get have a base class full 
of nothing but abstract functions.

> Current Multiple Regression Object does calculations with all data incore. 
> There are non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MATH-607
>                 URL: https://issues.apache.org/jira/browse/MATH-607
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, 
> lemma
>             Fix For: 3.0
>
>         Attachments: updating_reg_ifaces
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete 
> data set. This necessitates the loading incore of the complete dataset. For 
> large datasets, or large datasets and a requirement to do datamining or 
> stepwise regression this is not practical. There are techniques which form 
> the normal equations on the fly, as well as ones which form the QR 
> decomposition on an update basis. I am proposing, first, the specification of 
> an "UpdatingLinearRegression" interface which defines basic functionality all 
> such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on 
> some subset of the data should be encapsulated in an immutable object. This 
> is to ensure that subsequent additions of observations do not corrupt or 
> render inconsistent parameter estimates. I am calling this interface 
> "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the 
> concrete implementation of these techniques will take place.
> Thanks,
> -Greg

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to