Re: [R] biglm: how it handles large data set?

2010-11-01 Thread Mike Marchywka







 Date: Sun, 31 Oct 2010 00:22:12 -0700
 From: tim@netzero.net
 To: r-help@r-project.org
 Subject: [R] biglm: how it handles large data set?



 I am trying to figure out why 'biglm' can handle large data set...

 According to the R document - biglm creates a linear model object that uses
 only p^2 memory for p variables. It can be updated with more data using
 update. This allows linear regression on data sets larger than memory.

I'm not sure anyone answered the question but let me make some
comments having done something similar with non-R code before and motivate
my earlier comments about streaming data into a stats widget.
Probably this creates a matrix of some sort with various moments/ 
sums-of-powers
of the data like IIRC what the stats books call computing formulas.
Each new data point simply adds to the matrix
elements, it needn't be stored by itself- in the simple case of finding 
an average for example each data point just ads to N and a sum and 
you divide the two when finished. So, anyway, up to the limits
of the floating point implementation( when each new y^n is too small to 
add a non-zero delta to the current sum LOL) , you can keep updating the matrix
elements with very large data sets and your memory requirement is just
due to matrix elements not number of data points. Finally you invert
the matrix to get your answer. The ordere you quote seem about
right IIRC as I tried to fit some image related data to a polynomial.
You can probably just write the equations yourself, rearrange terms to
express as sums over past data, and see that your coefficients come from
the matrix inverse. 



 After reading the source code below, I still could not figure out how
 'update' implements the algorithm...

 Thanks for any light shed upon this ...

  biglm::biglm

 function (formula, data, weights = NULL, sandwich = FALSE)
 {
 tt - terms(formula)
 if (!is.null(weights)) {
 if (!inherits(weights, formula))
 stop(`weights' must be a formula)
 w - model.frame(weights, data)[[1]]
 }
 else w - NULL
 mf - model.frame(tt, data)
 mm - model.matrix(tt, mf)
 qr - bigqr.init(NCOL(mm))
 qr - update(qr, mm, model.response(mf), w)
 rval - list(call = sys.call(), qr = qr, assign = attr(mm,
 assign), terms = tt, n = NROW(mm), names = colnames(mm),
 weights = weights)
 if (sandwich) {
 p - ncol(mm)
 n - nrow(mm)
 xyqr - bigqr.init(p * (p + 1))
 xx - matrix(nrow = n, ncol = p * (p + 1))
 xx[, 1:p] - mm * model.response(mf)
 for (i in 1:p) xx[, p * i + (1:p)] - mm * mm[, i]
 xyqr - update(xyqr, xx, rep(0, n), w * w)
 rval$sandwich - list(xy = xyqr)
 }
 rval$df.resid - rval$n - length(qr$D)
 class(rval) - biglm
 rval
 }
 
 ---
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/biglm-how-it-handles-large-data-set-tp3020890p3020890.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] biglm: how it handles large data set?

2010-10-31 Thread noclue_


I am trying to figure out why 'biglm' can handle large data set... 

According to the R document - biglm creates a linear model object that uses
only p^2  memory for p variables. It can be updated with more data using
update. This allows linear regression on data sets larger than memory.

After reading the source code below, I still could not figure out how
'update'  implements the algorithm...

Thanks for any light shed upon this ... 

 biglm::biglm

function (formula, data, weights = NULL, sandwich = FALSE) 
{
tt - terms(formula)
if (!is.null(weights)) {
if (!inherits(weights, formula)) 
stop(`weights' must be a formula)
w - model.frame(weights, data)[[1]]
}
else w - NULL
mf - model.frame(tt, data)
mm - model.matrix(tt, mf)
qr - bigqr.init(NCOL(mm))
qr - update(qr, mm, model.response(mf), w)
rval - list(call = sys.call(), qr = qr, assign = attr(mm, 
assign), terms = tt, n = NROW(mm), names = colnames(mm), 
weights = weights)
if (sandwich) {
p - ncol(mm)
n - nrow(mm)
xyqr - bigqr.init(p * (p + 1))
xx - matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] - mm * model.response(mf)
for (i in 1:p) xx[, p * i + (1:p)] - mm * mm[, i]
xyqr - update(xyqr, xx, rep(0, n), w * w)
rval$sandwich - list(xy = xyqr)
}
rval$df.resid - rval$n - length(qr$D)
class(rval) - biglm
rval
}
environment: namespace:biglm
---
-- 
View this message in context: 
http://r.789695.n4.nabble.com/biglm-how-it-handles-large-data-set-tp3020890p3020890.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.