Re: [R] R versus SAS: lm performance

roger koenker Tue, 11 May 2004 05:40:26 -0700

I would be curious to know how sparse the model.matrix for this problem is... Unless it is quite dense, or as Brian implies quite singular, I might suggest computing a Cholesky factorization in SparseM.


url:        www.econ.uiuc.edu/~roger                Roger Koenker
email   [EMAIL PROTECTED]                       Department of Economics
vox:    217-333-4558                            University of Illinois
fax:    217-244-6678                            Champaign, IL 61820

On May 11, 2004, at 7:07 AM, Douglas Bates wrote:

<[EMAIL PROTECTED]> writes:

Hello,

A collegue of mine has compared the runtime of a linear model + anova in SAS and S+. He got the same results, but SAS took a bit more than a minute whereas S+ took 17 minutes. I've tried it in R (1.9.0) and it took 15 min. Neither machine run out of memory, and I assume that all machines have similar hardware, but the S+ and SAS machines are on windows whereas the R machine is Redhat Linux 7.2.

My question is if I'm doing something wrong (technically) calling the lm routine, or (if not), how I can optimize the call to lm or even using an alternative to lm. I'd like to run about 12,000 of these models in R (for a gene expression experiment - one model per gene, which would take far too long).

I've run the follwong code in R (and S+):
...
As Brian Ripley mentioned, you could save the model matrix and use it
with each of your responses.  Versions 0.8-1 and later of the Matrix
package have a vignette that provides comparative timings of various
ways of obtaining the least squares estimates.  If you use the classes
from the Matrix package and create and save the crossproduct of the
model matrix
mm = as(model.matrix(Va ~ Ba+Ti..., df), "geMatrix")
cprod = crossprod(mm)
then successive calls to

coef = solve(cprod, crossprod(mm, df$Va))
will produce the coefficient estimates much faster than will calls to
lm, which each do all the work of generating and decomposing the very
large model matrix.
Note that this method only produces the coefficient estimates, which
may be enough for your purposes.  Also, this method will not handle
missing data or rank-deficient model matrices in the elegant way that
lm does.
If you are doing this 12,000 times it may be worthwhile checking if
the sparse matrix formulation
mmS = as(mm, "cscMatrix")
cprodS = crossprod(mmS)
is faster.
The dense matrix formulation (but not the sparse) can benefit from
installation of optimized BLAS routines such as Atlas or Goto's BLAS.
-- Douglas Bates [EMAIL PROTECTED] Statistics Department 608/262-2598 University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/

______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R versus SAS: lm performance

Reply via email to