Luc Maisonobe wrote:
I'm afraid what I propose is far more basic than what you need. I hardly
understand the concepts you use in your message (I'm not a statistician
at all, I work in space flight dynamics). I assume we use the same words
with slightly different meanings in mind (models, residuals,
observations, variables ...).
I'm clearly unable to fulfill your requirements.
What is going on here is that the special case - linear regression
analysis - is being discussed as an example of an optimization problem.
This is correct mathematically, but from the API standpoint it violates
a principle that I would like to see us stick to, which is to make it
as easy as possible to find solutions to practical programming problems
using commons-math. From this standpoint, it is clear that to estimate
linear models, one should look in the stat.regression package. The API
and functionality there is currently primitive. It is going to have to
be cleaned up and extended to be releasable, IMO. Suggestions for
improvement - ideally with patches - are most welcome. Just as the
implementations there use classes from the linear package, it may turn
out to be useful (not having designed or written any code, I can't say
for sure) to use implementations from the optimization package for some
not-yet-implemented features. But the user-facing API and the user
guide should point to the stat package. This is going to happen more
and more as the overall library becomes more powerful - there will be
multiple ways to use commons-math to solve problems. We need to have a
simple, minimally complex (from API and math concepts required
standpoint) solution path present itself to the user when browsing the
javadoc or reading the user guide looking for solutions to common
practical problems. That means from the API standpoint we break out
common use cases rather than trying to build a pseudo-programming
language or overly abstract API.
While I love R, I would hardly describe it as easy to use. The goal of
a good library API in my opinion is not just to minimize the number of
lines of code required, as this can lead to inscurtable code (like many
R programs). A decent java API with good javadoc that focusses on
solving the most common practical problems is what we are after in
commons-math. We certainly have much room for improvement and that is
what we are trying to do in 2.0. Thanks in advance to all for feedback
and suggestions for improvement. Patches welcome!
Phil
Does this imply we continue with only revamping the basic setup that was
imported from mantissa for 1.2 in order to have a simplistic but
consistent framework ? Does this imply we get rid of this limited view
and take the opportunity of version 2.0 to build a complete
statistically sound estimation package ?
I can do the former (taking into account the residuals computation
issues, and probably slightly changing the meaning of the weights to
compute wi*(ri^2) rather than (wi*ri)^2 as done now). I cannot do the
latter.
Luc
Ted Dunning wrote :
Here is what I would do in R to do a linear model with observations
consisting of input variables x1 ... x3 and target variable y:
m <- lm(y ~ x1 + x2 + x3, data)
The first argument lets me supply the form of the regression. I could have
regressed on log(x3), for instance, or used x1*x2 + x3 to include x1, x2, x3
and the x1:x2 interaction or (x1+x2+x3)^2.
To do logistic regression, I would switch to glm (for generalized linear
model):
m <- glm(y ~ x1 + x2 + x3, data, family=binomial())
To do Bayesian logistic regression, I would use bayesglm from the arm
library:
# this happens once to install the package
install.packages("arm");
# this happens once per session
library(arm)
# and this gives me my model
m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial())
To give weights to observations using the k column of the data:
m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial(), weights=data$k)
If I want to compute the posterior distribution of outputs, I can use
sims(m)$coefs to get a bunch of samples from the coefficient posterior and
multiply them by the observation matrix in one or two (fairly simple) lines
of code. If I just want outputs for MAP model estimates, then I do
ytwiddle = predict(m, newdata)
This works for any of the models above.
If I could do anything like this using commons math in less than 10x as many
lines of code, I would be thrilled.
Some notes about the user experience here:
a) I don't have to care about residuals, but if I do care, they are in the
resulting model structure and I can look at them. If I do plot(m) I get a
bunch of interesting diagnostic plots. If I do coefplot(m), I get my
coefficients with error bars.
b) I don't care *how* my weights affect the algorithm. It just works. I
can dig into the guts in various ways, but I don't have to.
c) I can effectively subset my data column-wise without creating new
matrices. I can also change the structure of my model all over the place
very easily without changing the data. This is (was) a controversial aspect
of R's modeling structure and it makes it really hard to find logistic
regression in the index of a book because the libraries are oriented around
the generalized linear modeling view of the world (which isn't so bad after
you get going with it).
d) lots of diagnostic info gets embedded into the resulting model.
On Thu, Feb 26, 2009 at 12:14 PM, Luc Maisonobe <[email protected]>wrote:
This way, the residual would not be seen and only the
pure observation vectors should be provided by user.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]