Re: [R] the large dataset problem

Adrian Dragulescu Sat, 04 Aug 2007 04:40:20 -0700

Take a look at the package filehash.  It allows you to work with large
objects in R (bigger than your RAM) by storing them on the disk.  The
objects are represented as pointers in R and have a small footprint in
memory.  You can load all of them in an environment and access them with the
$ operator. I think filehash is more general than R.huge.  R.huge works very
well with numerical 2D data only.


Adrian Dragulescu


On 7/31/07, Eric Doviak <[EMAIL PROTECTED]> wrote:
>
>
> Just a note of thanks for all the help I have received. I haven't gotten a
> chance to implement any of your suggestions because I'm still trying to
> catalog all of them! Thank you so much!
>
> Just to recap (for my own benefit and to create a summary for others):
>
> Bruce Bernzweig suggested using the  R.huge  package.
>
> Ben Bolker pointed out that my original message wasn't clear and asked
> what I want to do with the data. At this point, just getting a dataset
> loaded would be wonderful, so I'm trying to trim variables (and if possible,
> I would also like to trim observations). He also provided an example of
> "vectorizing."
>
> Ted Harding suggested that I use AWK to process the data and provided the
> necessary code. He also tested his code on older hardware running GNU-Linux
> (or Unix?) and showed that AWK can process the data even when the computer
> has very little memory and processing power. Jim Holtman had similar success
> when he used Cygwin's UNIX utilities on a machine running MS Windows. They
> both used the following code:
>
>      gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) ","  $(5678)}'
>      < tempxx.txt > newdata.csv
>
> Fortunately, there is a version of GAWK for MS Windows. ... Not that I
> like MS Windows. It's just that I'm forced to use that 19th century
> operating system on the job. (After using Debian at home and happily running
> RKWard for my dissertation, returning to Windows World is downright
> depressing).
>
> Roland Rau suggested that I use a database with RSQLite and pointed out
> that RODBC can work with MS Access. He also pointed me to a sub-chapter in
> Venables and Ripley's _S Programming_ and "The Whole-Object View" pages in
> John Chamber's _Programming with Data_.
>
> Greg Snow recommended  biglm  for regression analysis with data that is
> too large to fit into memory.
>
> Last, but not least, Peter Dalgaard pointed out that there are options
> within R. He suggests using the colClasses= argument for when "reading" data
> and the what= argument for "scanning" data, so that you don't load more
> columns than necessary. He also provided the following script:
>
>      dict <- readLines("
> ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt";)
>      D.lines <- grep("^D ", dict)
>      vdict <- read.table(con <- textConnection(dict[D.lines])); close(con)
>      head(vdict)
>
> I'll try these solutions and report back on my success.
>
> Thanks again!
> - Eric
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] the large dataset problem

Reply via email to