Re: [math] redesigning the optimization/estimation packages (phase 2)

Phil Steitz Sat, 28 Feb 2009 06:53:25 -0800

Luc Maisonobe wrote:

I'm afraid what I propose is far more basic than what you need. I hardly
understand the concepts you use in your message (I'm not a statistician
at all, I work in space flight dynamics). I assume we use the same words
with slightly different meanings in mind (models, residuals,
observations, variables ...).


I'm clearly unable to fulfill your requirements.

What is going on here is that the special case - linear regressionanalysis - is being discussed as an example of an optimization problem.This is correct mathematically, but from the API standpoint it violatesa principle that I would like to see us stick to, which is to make itas easy as possible to find solutions to practical programming problemsusing commons-math. From this standpoint, it is clear that to estimatelinear models, one should look in the stat.regression package. The APIand functionality there is currently primitive. It is going to have tobe cleaned up and extended to be releasable, IMO. Suggestions forimprovement - ideally with patches - are most welcome. Just as theimplementations there use classes from the linear package, it may turnout to be useful (not having designed or written any code, I can't sayfor sure) to use implementations from the optimization package for somenot-yet-implemented features. But the user-facing API and the userguide should point to the stat package. This is going to happen moreand more as the overall library becomes more powerful - there will bemultiple ways to use commons-math to solve problems. We need to have asimple, minimally complex (from API and math concepts requiredstandpoint) solution path present itself to the user when browsing thejavadoc or reading the user guide looking for solutions to commonpractical problems. That means from the API standpoint we break outcommon use cases rather than trying to build a pseudo-programminglanguage or overly abstract API.

While I love R, I would hardly describe it as easy to use. The goal ofa good library API in my opinion is not just to minimize the number oflines of code required, as this can lead to inscurtable code (like manyR programs). A decent java API with good javadoc that focusses onsolving the most common practical problems is what we are after incommons-math. We certainly have much room for improvement and that iswhat we are trying to do in 2.0. Thanks in advance to all for feedbackand suggestions for improvement. Patches welcome!


Phil

Does this imply we continue with only revamping the basic setup that was
imported from mantissa for 1.2 in order to have a simplistic but
consistent framework ? Does this imply we get rid of this limited view
and take the opportunity of version 2.0 to build a complete
statistically sound estimation package ?

I can do the former (taking into account the residuals computation
issues, and probably slightly changing the meaning of the weights to
compute wi*(ri^2) rather than (wi*ri)^2 as done now). I cannot do the
latter.

Luc

Ted Dunning wrote :

Here is what I would do in R to do a linear model with observations
consisting of input variables x1 ... x3 and target variable y:

     m <- lm(y ~ x1 + x2 + x3, data)

The first argument lets me supply the form of the regression.  I could have
regressed on log(x3), for instance, or used x1*x2 + x3 to include x1, x2, x3
and the x1:x2 interaction or (x1+x2+x3)^2.

To do logistic regression, I would switch to glm (for generalized linear
model):

     m <- glm(y ~ x1 + x2 + x3, data, family=binomial())

To do Bayesian logistic regression, I would use bayesglm from the arm
library:

    # this happens once to install the package
    install.packages("arm");
    # this happens once per session
    library(arm)
    # and this gives me my model
    m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial())

To give weights to observations using the k column of the data:

    m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial(), weights=data$k)

If I want to compute the posterior distribution of outputs, I can use
sims(m)$coefs to get a bunch of samples from the coefficient posterior and
multiply them by the observation matrix in one or two (fairly simple) lines
of code.  If I just want outputs for MAP model estimates, then I do

    ytwiddle = predict(m, newdata)

This works for any of the models above.

If I could do anything like this using commons math in less than 10x as many
lines of code, I would be thrilled.

Some notes about the user experience here:

a) I don't have to care about residuals, but if I do care, they are in the
resulting model structure and I can look at them.  If I do plot(m) I get a
bunch of interesting diagnostic plots.  If I do coefplot(m), I get my
coefficients with error bars.

b) I don't care *how* my weights affect the algorithm.  It just works.  I
can dig into the guts in various ways, but I don't have to.

c) I can effectively subset my data column-wise without creating new
matrices.  I can also change the structure of my model all over the place
very easily without changing the data.  This is (was) a controversial aspect
of R's modeling structure and it makes it really hard to find logistic
regression in the index of a book because the libraries are oriented around
the generalized linear modeling view of the world (which isn't so bad after
you get going with it).

d) lots of diagnostic info gets embedded into the resulting model.




On Thu, Feb 26, 2009 at 12:14 PM, Luc Maisonobe <[email protected]>wrote:

This way, the residual would not be seen and only the
pure observation vectors should be provided by user.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [math] redesigning the optimization/estimation packages (phase 2)

Reply via email to