Re: [math] Improving numerics in OLSMultipleLinearRegression

Mauro Talevi Thu, 12 Jun 2008 05:16:20 -0700

Phil Steitz wrote:

Yes, and I would distinguish performance optimization from numericalaccuracy. From my perspective, we can release a ".0" with room forperformance improvement, but at least decent numerics are required.

I agree that decent numerics are required. I'm still rather surprisedthat the diagonal covariance case would yield such bad numerics wrt theGLS case - which has been tested with independent fortran code to alevel of 10^-6.

We have talked in the past about providing an implementation based onQR decomposition. Anyone up for using the QR decomposition that wenow have to do this? I really think we need to do it (or somethingelse to improve numerics) before releasing this class. I will get toit eventually, but am a little pegged at the moment.

Are you proposing doing a QR decomposition of both the X and Y matricesand working out the formulas using the decomposed ones?

Here are some initial ideas on what should be included in the multipleregression API. Other suggestions welcome!
1. Coefficients should be accompanied by standard errors, t-statistics,two-sided t probablilities (can get these using t distribution fromdistributions package) and ideally confidence intervals.2. F, R-square, adjusted R-square, F prob (again can use distributionspackage to estimate)
3.  ANOVA table (Regression sum of squares, residual sum of squares)
4.  Residuals
R, SAS, SPSS and Excel all represent (or in the case of R, canconstruct) these basic statistics in some way in their output. Weshould model them in classes representing properties of the computed model.


Perhaps we should put these on the wiki or even better in jira.

IMO, it's best to deal with the numerics and the new data inputstrategies, before adding new functionality in the frame.

And finally, how do you see the no/hasIntercept model working?
As a configurable property - noIntercept means the model is estimatedwithout an intercept. The point I was making was more how the data issupplied via the API. It is awkward to have to fill in a column of 1'sto get the linear algebra to work to estimate a model with intercept(which should be the default).


ok - good point.

I would recommend that we have setData or "newData" provide a n x mmatrix, where n is the number of observations and m-1 is the number ofindependent variables. Then either a) have the constructor take anotherargument specifying which column holds the dependent variable b) assumeit is the first column c) support column labels and some form of modelspecification such as what R provides (a lot of work) d) split off the yvector, so setting data requires separate x and y vectors. Probably a)is easiest for users, who will most often be starting with a rectangulararray of data with the dependent variable in one of the columns.

Perhaps it would help if we had overloaded newData methods that acceptdifferent input strategies, but ultimately they will produce a n x mdouble array. That way we can provide users with choice.


I'll get to it in the next week or so - ATM I'm a bit loaded myself.

Cheers


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] Improving numerics in OLSMultipleLinearRegression

Reply via email to