Hi:

The reason why you can fit N linear models to the same predictor
quickly is because the model matrix X is fixed and you can store the
multiple Y's as a matrix. Consequently, you can run the N linear
regressions in one shot using least squares, and lm() implements this
efficiently by allowing the response to be a matrix.

Your intention is to keep Y fixed and change the model matrix each
time. With 6000 potential predictors, this means 6000 separate
regression analyses, or to put it another way, 6000 iterations of the
least squares algorithm. It's unrealistic to expect the same timing
performance because the computing tasks are entirely different in
nature. As Josh showed, you can improve timing performance by
stripping down the computing task to its bare essentials. It might
also be worthwhile to peruse the article by Douglas Bates on least
squares calculations in R (start on p. 17) for some useful tips:
http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf

And for God's sake, don't run regression through the origin by
default, even if it's 'scientifically justifiable'. If you don't know
why, you need to do some reading. The topic has been hashed over in
this list several times - check the archives or do a Google search.

HTH,
Dennis

On Fri, Jul 29, 2011 at 8:30 AM,  <[email protected]> wrote:
> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor 
> vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
> After looking through the R archive, I found roughly 3 methods that has been 
> suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and 
> would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time 
> 13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors, 
> that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was 
> wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + 
> as.vector(x))$residuals)))
>
> user  system elapsed
>  15.076   0.048  15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), 
> nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
>    pred = as.vector(predictors[i,])
>    data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
>  user  system elapsed
>  13.841   0.012  13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)), 
> x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), 
> ncol=ncol(predictors))
>
> user  system elapsed
>  36.407   0.484  36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
>  user  system elapsed
>  0.500   0.120   0.613
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to