Re: [R] Questions about biglm

2009-02-19 Thread Greg Snow
The idea of the biglm function is to only have part of the data in memory at a 
time.  You read in part of the data and run biglm on that section of the data, 
then delete it from memory, load in the next part of the data and use update to 
include the new data in the analysis, delete that, read in the next group, run 
update, and repeat until you have processed all the data.  The result will then 
be the same as if you ran lm on the entire dataset (possible slight differences 
due to rounding).  The bigglm function or code from other packages (SQLiteDF 
for one) can automate this a bit more.

The code for VIF below uses the model.matrix command, this returns the x matrix 
for the analysis when used with an lm object. Since biglm is based on the idea 
of not having all the data in memory at once, I would be very surprised if 
model.matrix worked with biglm objects, so that code is unlikely to work as is.

One approach is to do VIF and other diagnostics on a subset of the data (random 
sample, stratified random sample) that fits easily into memory, then after 
making decisions about the model based on the diagnostics, run the final model 
with biglm to get the precise results using the full data set.  You can do the 
diagnostics on a couple different random subsets to confirm the decisions made.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of dobomode
 Sent: Wednesday, February 18, 2009 9:34 PM
 To: r-help@r-project.org
 Subject: [R] Questions about biglm
 
 Hello folks,
 
 I am very excited to have discovered R and have been exploring its
 capabilities. R's regression models are of great interest to me as my
 company is in the business of running thousands of linear regressions
 on large datasets.
 
 I am using biglm to run linear regressions on datasets that are as
 large as several GB's. I have been pleasantly surprised that biglm
 runs the regressions extremely fast (one regression may take minutes
 in SPSS vs seconds in R).
 
 I have been trying to wrap my head around biglm and have a couple of
 questions.
 
 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was
 able to get VIF's from the regular lm function using this piece of
 code I found through Google, but have not been able to adapt it to
 work with biglm. Hasn't anyone been successful in this?
 
 vif.lm - function(object, ...) {
   V - summary(object)$cov.unscaled
   Vi - crossprod(model.matrix(object))
 nam - names(coef(object))
   if(k - match((Intercept), nam, nomatch = F)) {
 v1 - diag(V)[-k]
 v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
 nam - nam[-k]
 } else {
 v1 - diag(V)
 v2 - diag(Vi)
 warning(No intercept term detected. Results may
 surprise.)
 }
 structure(v1*v2, names = nam)
 }
 
 2. How reliable / stable is biglm's update() function? I was
 experimenting with running regressions on individual chunks of my
 large dataset, but the coefficients I got were different compared to
 those obtained form running biglm on the whole dataset. Am I mistaken
 when I say that update() is intended to run regressions in chunks
 (when memory becomes an issue with datasets that are too large) and
 produce identical results to running a single regression on the
 dataset as a whole?
 
 Thanks!
 
 Dobo
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Questions about biglm

2009-02-19 Thread John Fox
Dear Greg and Dobo,

The vif() in the car package computes VIFs (and generalized VIFs) from the
covariance matrix of the coefficients; I'm not sure whether it will work
directly on objects produced by biglm() but if not it should be easily
adapted to do so.

I hope this helps,
 John

--
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On
 Behalf Of Greg Snow
 Sent: February-19-09 11:35 AM
 To: dobomode; r-help@r-project.org
 Subject: Re: [R] Questions about biglm
 
 The idea of the biglm function is to only have part of the data in memory
at
 a time.  You read in part of the data and run biglm on that section of the
 data, then delete it from memory, load in the next part of the data and
use
 update to include the new data in the analysis, delete that, read in the
next
 group, run update, and repeat until you have processed all the data.  The
 result will then be the same as if you ran lm on the entire dataset
(possible
 slight differences due to rounding).  The bigglm function or code from
other
 packages (SQLiteDF for one) can automate this a bit more.
 
 The code for VIF below uses the model.matrix command, this returns the x
 matrix for the analysis when used with an lm object. Since biglm is based
on
 the idea of not having all the data in memory at once, I would be very
 surprised if model.matrix worked with biglm objects, so that code is
unlikely
 to work as is.
 
 One approach is to do VIF and other diagnostics on a subset of the data
 (random sample, stratified random sample) that fits easily into memory,
then
 after making decisions about the model based on the diagnostics, run the
 final model with biglm to get the precise results using the full data set.
 You can do the diagnostics on a couple different random subsets to confirm
 the decisions made.
 
 Hope this helps,
 
 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 greg.s...@imail.org
 801.408.8111
 
 
  -Original Message-
  From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
  project.org] On Behalf Of dobomode
  Sent: Wednesday, February 18, 2009 9:34 PM
  To: r-help@r-project.org
  Subject: [R] Questions about biglm
 
  Hello folks,
 
  I am very excited to have discovered R and have been exploring its
  capabilities. R's regression models are of great interest to me as my
  company is in the business of running thousands of linear regressions
  on large datasets.
 
  I am using biglm to run linear regressions on datasets that are as
  large as several GB's. I have been pleasantly surprised that biglm
  runs the regressions extremely fast (one regression may take minutes
  in SPSS vs seconds in R).
 
  I have been trying to wrap my head around biglm and have a couple of
  questions.
 
  1. How can I get VIF's (Variance Inflation Factors) using biglm? I was
  able to get VIF's from the regular lm function using this piece of
  code I found through Google, but have not been able to adapt it to
  work with biglm. Hasn't anyone been successful in this?
 
  vif.lm - function(object, ...) {
V - summary(object)$cov.unscaled
Vi - crossprod(model.matrix(object))
  nam - names(coef(object))
if(k - match((Intercept), nam, nomatch = F)) {
  v1 - diag(V)[-k]
  v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
  nam - nam[-k]
  } else {
  v1 - diag(V)
  v2 - diag(Vi)
  warning(No intercept term detected. Results may
  surprise.)
  }
  structure(v1*v2, names = nam)
  }
 
  2. How reliable / stable is biglm's update() function? I was
  experimenting with running regressions on individual chunks of my
  large dataset, but the coefficients I got were different compared to
  those obtained form running biglm on the whole dataset. Am I mistaken
  when I say that update() is intended to run regressions in chunks
  (when memory becomes an issue with datasets that are too large) and
  produce identical results to running a single regression on the
  dataset as a whole?
 
  Thanks!
 
  Dobo
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-
  guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do

[R] Questions about biglm

2009-02-18 Thread dobomode
Hello folks,

I am very excited to have discovered R and have been exploring its
capabilities. R's regression models are of great interest to me as my
company is in the business of running thousands of linear regressions
on large datasets.

I am using biglm to run linear regressions on datasets that are as
large as several GB's. I have been pleasantly surprised that biglm
runs the regressions extremely fast (one regression may take minutes
in SPSS vs seconds in R).

I have been trying to wrap my head around biglm and have a couple of
questions.

1. How can I get VIF's (Variance Inflation Factors) using biglm? I was
able to get VIF's from the regular lm function using this piece of
code I found through Google, but have not been able to adapt it to
work with biglm. Hasn't anyone been successful in this?

vif.lm - function(object, ...) {
  V - summary(object)$cov.unscaled
  Vi - crossprod(model.matrix(object))
nam - names(coef(object))
  if(k - match((Intercept), nam, nomatch = F)) {
v1 - diag(V)[-k]
v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
nam - nam[-k]
} else {
v1 - diag(V)
v2 - diag(Vi)
warning(No intercept term detected. Results may
surprise.)
}
structure(v1*v2, names = nam)
}

2. How reliable / stable is biglm's update() function? I was
experimenting with running regressions on individual chunks of my
large dataset, but the coefficients I got were different compared to
those obtained form running biglm on the whole dataset. Am I mistaken
when I say that update() is intended to run regressions in chunks
(when memory becomes an issue with datasets that are too large) and
produce identical results to running a single regression on the
dataset as a whole?

Thanks!

Dobo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.