Re: [R] the large dataset problem

Eric Doviak Tue, 31 Jul 2007 05:27:20 -0700

Just a note of thanks for all the help I have received. I haven't gotten a 
chance to implement any of your suggestions because I'm still trying to catalog 
all of them! Thank you so much!


Just to recap (for my own benefit and to create a summary for others):

Bruce Bernzweig suggested using the  R.huge  package.

Ben Bolker pointed out that my original message wasn't clear and asked what I 
want to do with the data. At this point, just getting a dataset loaded would be 
wonderful, so I'm trying to trim variables (and if possible, I would also like 
to trim observations). He also provided an example of "vectorizing."

Ted Harding suggested that I use AWK to process the data and provided the 
necessary code. He also tested his code on older hardware running GNU-Linux (or 
Unix?) and showed that AWK can process the data even when the computer has very 
little memory and processing power. Jim Holtman had similar success when he 
used Cygwin's UNIX utilities on a machine running MS Windows. They both used 
the following code:

     gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) ","  $(5678)}'
     < tempxx.txt > newdata.csv

Fortunately, there is a version of GAWK for MS Windows. ... Not that I like MS 
Windows. It's just that I'm forced to use that 19th century operating system on 
the job. (After using Debian at home and happily running RKWard for my 
dissertation, returning to Windows World is downright depressing). 

Roland Rau suggested that I use a database with RSQLite and pointed out that 
RODBC can work with MS Access. He also pointed me to a sub-chapter in Venables 
and Ripley's _S Programming_ and "The Whole-Object View" pages in John 
Chamber's _Programming with Data_. 

Greg Snow recommended  biglm  for regression analysis with data that is too 
large to fit into memory.

Last, but not least, Peter Dalgaard pointed out that there are options within 
R. He suggests using the colClasses= argument for when "reading" data and the 
what= argument for "scanning" data, so that you don't load more columns than 
necessary. He also provided the following script: 

     dict <- readLines("ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt";)
     D.lines <- grep("^D ", dict)
     vdict <- read.table(con <- textConnection(dict[D.lines])); close(con)
     head(vdict) 

I'll try these solutions and report back on my success.

Thanks again!
- Eric

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] the large dataset problem

Reply via email to