>Its actually only a few lines of code to do this from first principles. >The coefficients depend only on the cross products X'X and X'y and you >can build them up easily by extending this example to read files or >a database holding x and y instead of getting them from the args. >Here we process incr rows of builtin matrix state.x77 at a time >building up the two cross productxts, xtx and xty, regressing >Income (variable 2) on the other variables:
>mylm <- function(x, y, incr = 25) { > start <- xtx <- xty <- 0 > while(start < nrow(x)) { > idx <- seq(start + 1, min(start + incr, nrow(x))) > x1 <- cbind(1, x[idx,]) > xtx <- xtx + crossprod(x1) > xty <- xty + crossprod(x1, y[idx]) > start <- start + incr > } > solve(xtx, xty) >} >mylm(state.x77[,-2], state.x77[,2]) >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote: > I'd like to fit linear models on very large datasets. My data frames > are about 2000000 rows x 200 columns of doubles and I am using an 64 > bit build of R. I've googled about this extensively and went over the > "R Data Import/Export" guide. My primary issue is although my data > represented in ascii form is 4Gb in size (therefore much smaller > considered in binary), R consumes about 12Gb of virtual memory. > > What exactly are my options to improve this? I looked into the biglm > package but the problem with it is it uses update() function and is > therefore not transparent (I am using a sophisticated script which is > hard to modify). I really liked the concept behind the LM package > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html > But it is no longer available. How could one fit linear models to very > large datasets without loading the entire set into memory but from a > file/database (possibly through a connection) using a relatively > simple modification of standard lm()? Alternatively how could one > improve the memory usage of R given a large dataset (by changing some > default parameters of R or even using on-the-fly compression)? I don't > mind much higher levels of CPU time required. > > Thank you in advance for your help. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > If your design matrix X is very well behaved this approach may work for you. Often just doing solve(X'X,y) will fail for numerical reasons. The right way to do it is tofactor the matrix X as X = A * B where B is 200x200 in your case and A is 2000000 x 200 with A'*A = I (That is A is orthogonal.) so X'*X = B' *B and you use solve(B'*B,y); To find A and B you can use modified Gram-Schmidt which is very easy to program and works well when you wish to store the columns of X on a hard disk and just read in a bit at a time. Some people claim that modifed Gram-Schmidt is unstable but it has always worked well for me. In any event you can check to ensure that A'*A = I and X=A*B Cheers, Dave -- David A. Fournier P.O. Box 2040, Sidney, B.C. V8l 3S3 Canada Phone/FAX 250-655-3364 http://otter-rsch.com ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.