> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Bi-Info (http://members.home.nl/bi-info) > Sent: Monday, April 09, 2007 4:23 PM > To: Gabor Grothendieck > Cc: Lorenzo Isella; [email protected] > Subject: Re: [R] Reasons to Use R
[snip] > So what's the big deal about S using files instead of memory > like R. I don't get the point. Isn't there enough swap space > for S? (Who cares > anyway: it works, isn't it?) Or are there any problems with S > and large datasets? I don't get it. You use them, Greg. So > you might discuss that issue. > > Wilfred > > This is my understanding of the issue (not anything official). If you use up all the memory while in R, then the OS will start swapping memory to disk, but the OS does not know what parts of memory correspond to which objects, so it is entirely possible that the chunk swapped to disk contains parts of different data objects, so when you need one of those objects again, everything needs to be swapped back in. This is very inefficient. S-PLUS occasionally runs into the same problem, but since it does some of its own swapping to disk it can be more efficient by swapping single data objects (data frames, etc.). Also, since S-PLUS is already saving everything to disk, it does not actually need to do a full swap, it can just look and see that a particular data frame has not been used for a while, know that it is already saved on the disk, and unload it from memory without having to write it to disk first. The g.data package for R has some of this functionality of keeping data on the disk until needed. The better approach for large data sets is to only have some of the data in memory at a time and to automatically read just the parts that you need. So for big datasets it is recommended to have the actual data stored in a database and use one of the database connection packages to only read in the subset that you need. The SQLiteDF package for R is working on automating this process for R. There are also the bigdata module for S-PLUS and the biglm package for R have ways of doing some of the common analyses using chunks of data at a time. This idea is not new. There was a program in the late 1970s and 80s called Rummage by Del Scott (I guess technically it still exists, I have a copy on a 5.25" floppy somewhere) that used the approach of specify the model you wanted to fit first, then specify the data file. Rummage would then figure out which sufficient statistics were needed and read the data in chunks, compute the sufficient statistics on the fly, and not keep more than a couple of lines of the data in memory at once. Unfortunately it did not have much of a user interface, so when memory was cheap and datasets only medium sized it did not compete well, I guess it was just a bit too ahead of its time. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
