Re: [R] Questions about biglm
The idea of the biglm function is to only have part of the data in memory at a time. You read in part of the data and run biglm on that section of the data, then delete it from memory, load in the next part of the data and use update to include the new data in the analysis, delete that, read in the next group, run update, and repeat until you have processed all the data. The result will then be the same as if you ran lm on the entire dataset (possible slight differences due to rounding). The bigglm function or code from other packages (SQLiteDF for one) can automate this a bit more. The code for VIF below uses the model.matrix command, this returns the x matrix for the analysis when used with an lm object. Since biglm is based on the idea of not having all the data in memory at once, I would be very surprised if model.matrix worked with biglm objects, so that code is unlikely to work as is. One approach is to do VIF and other diagnostics on a subset of the data (random sample, stratified random sample) that fits easily into memory, then after making decisions about the model based on the diagnostics, run the final model with biglm to get the precise results using the full data set. You can do the diagnostics on a couple different random subsets to confirm the decisions made. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of dobomode Sent: Wednesday, February 18, 2009 9:34 PM To: r-help@r-project.org Subject: [R] Questions about biglm Hello folks, I am very excited to have discovered R and have been exploring its capabilities. R's regression models are of great interest to me as my company is in the business of running thousands of linear regressions on large datasets. I am using biglm to run linear regressions on datasets that are as large as several GB's. I have been pleasantly surprised that biglm runs the regressions extremely fast (one regression may take minutes in SPSS vs seconds in R). I have been trying to wrap my head around biglm and have a couple of questions. 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was able to get VIF's from the regular lm function using this piece of code I found through Google, but have not been able to adapt it to work with biglm. Hasn't anyone been successful in this? vif.lm - function(object, ...) { V - summary(object)$cov.unscaled Vi - crossprod(model.matrix(object)) nam - names(coef(object)) if(k - match((Intercept), nam, nomatch = F)) { v1 - diag(V)[-k] v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam - nam[-k] } else { v1 - diag(V) v2 - diag(Vi) warning(No intercept term detected. Results may surprise.) } structure(v1*v2, names = nam) } 2. How reliable / stable is biglm's update() function? I was experimenting with running regressions on individual chunks of my large dataset, but the coefficients I got were different compared to those obtained form running biglm on the whole dataset. Am I mistaken when I say that update() is intended to run regressions in chunks (when memory becomes an issue with datasets that are too large) and produce identical results to running a single regression on the dataset as a whole? Thanks! Dobo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Questions about biglm
Dear Greg and Dobo, The vif() in the car package computes VIFs (and generalized VIFs) from the covariance matrix of the coefficients; I'm not sure whether it will work directly on objects produced by biglm() but if not it should be easily adapted to do so. I hope this helps, John -- John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario, Canada web: socserv.mcmaster.ca/jfox -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Greg Snow Sent: February-19-09 11:35 AM To: dobomode; r-help@r-project.org Subject: Re: [R] Questions about biglm The idea of the biglm function is to only have part of the data in memory at a time. You read in part of the data and run biglm on that section of the data, then delete it from memory, load in the next part of the data and use update to include the new data in the analysis, delete that, read in the next group, run update, and repeat until you have processed all the data. The result will then be the same as if you ran lm on the entire dataset (possible slight differences due to rounding). The bigglm function or code from other packages (SQLiteDF for one) can automate this a bit more. The code for VIF below uses the model.matrix command, this returns the x matrix for the analysis when used with an lm object. Since biglm is based on the idea of not having all the data in memory at once, I would be very surprised if model.matrix worked with biglm objects, so that code is unlikely to work as is. One approach is to do VIF and other diagnostics on a subset of the data (random sample, stratified random sample) that fits easily into memory, then after making decisions about the model based on the diagnostics, run the final model with biglm to get the precise results using the full data set. You can do the diagnostics on a couple different random subsets to confirm the decisions made. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of dobomode Sent: Wednesday, February 18, 2009 9:34 PM To: r-help@r-project.org Subject: [R] Questions about biglm Hello folks, I am very excited to have discovered R and have been exploring its capabilities. R's regression models are of great interest to me as my company is in the business of running thousands of linear regressions on large datasets. I am using biglm to run linear regressions on datasets that are as large as several GB's. I have been pleasantly surprised that biglm runs the regressions extremely fast (one regression may take minutes in SPSS vs seconds in R). I have been trying to wrap my head around biglm and have a couple of questions. 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was able to get VIF's from the regular lm function using this piece of code I found through Google, but have not been able to adapt it to work with biglm. Hasn't anyone been successful in this? vif.lm - function(object, ...) { V - summary(object)$cov.unscaled Vi - crossprod(model.matrix(object)) nam - names(coef(object)) if(k - match((Intercept), nam, nomatch = F)) { v1 - diag(V)[-k] v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam - nam[-k] } else { v1 - diag(V) v2 - diag(Vi) warning(No intercept term detected. Results may surprise.) } structure(v1*v2, names = nam) } 2. How reliable / stable is biglm's update() function? I was experimenting with running regressions on individual chunks of my large dataset, but the coefficients I got were different compared to those obtained form running biglm on the whole dataset. Am I mistaken when I say that update() is intended to run regressions in chunks (when memory becomes an issue with datasets that are too large) and produce identical results to running a single regression on the dataset as a whole? Thanks! Dobo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
[R] Questions about biglm
Hello folks, I am very excited to have discovered R and have been exploring its capabilities. R's regression models are of great interest to me as my company is in the business of running thousands of linear regressions on large datasets. I am using biglm to run linear regressions on datasets that are as large as several GB's. I have been pleasantly surprised that biglm runs the regressions extremely fast (one regression may take minutes in SPSS vs seconds in R). I have been trying to wrap my head around biglm and have a couple of questions. 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was able to get VIF's from the regular lm function using this piece of code I found through Google, but have not been able to adapt it to work with biglm. Hasn't anyone been successful in this? vif.lm - function(object, ...) { V - summary(object)$cov.unscaled Vi - crossprod(model.matrix(object)) nam - names(coef(object)) if(k - match((Intercept), nam, nomatch = F)) { v1 - diag(V)[-k] v2 - (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam - nam[-k] } else { v1 - diag(V) v2 - diag(Vi) warning(No intercept term detected. Results may surprise.) } structure(v1*v2, names = nam) } 2. How reliable / stable is biglm's update() function? I was experimenting with running regressions on individual chunks of my large dataset, but the coefficients I got were different compared to those obtained form running biglm on the whole dataset. Am I mistaken when I say that update() is intended to run regressions in chunks (when memory becomes an issue with datasets that are too large) and produce identical results to running a single regression on the dataset as a whole? Thanks! Dobo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.