I think SAS was developed at a time when computer memory was much smaller than it is now and the legacy of that is its better usage of computer resources.
On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > Greg, > As far as I understand, SAS is more efficient handling large data > probably than S+/R. Do you have any idea why? > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of > > > Bi-Info (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; [email protected] > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of memory > > > like R. I don't get the point. Isn't there enough swap space > > > for S? (Who cares > > > anyway: it works, isn't it?) Or are there any problems with S > > > and large datasets? I don't get it. You use them, Greg. So > > > you might discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start swapping > > memory to disk, but the OS does not know what parts of memory correspond > > to which objects, so it is entirely possible that the chunk swapped to > > disk contains parts of different data objects, so when you need one of > > those objects again, everything needs to be swapped back in. This is > > very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since it does some > > of its own swapping to disk it can be more efficient by swapping single > > data objects (data frames, etc.). Also, since S-PLUS is already saving > > everything to disk, it does not actually need to do a full swap, it can > > just look and see that a particular data frame has not been used for a > > while, know that it is already saved on the disk, and unload it from > > memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping data > > on the disk until needed. > > > > The better approach for large data sets is to only have some of the data > > in memory at a time and to automatically read just the parts that you > > need. So for big datasets it is recommended to have the actual data > > stored in a database and use one of the database connection packages to > > only read in the subset that you need. The SQLiteDF package for R is > > working on automating this process for R. There are also the bigdata > > module for S-PLUS and the biglm package for R have ways of doing some of > > the common analyses using chunks of data at a time. This idea is not > > new. There was a program in the late 1970s and 80s called Rummage by > > Del Scott (I guess technically it still exists, I have a copy on a 5.25" > > floppy somewhere) that used the approach of specify the model you wanted > > to fit first, then specify the data file. Rummage would then figure out > > which sufficient statistics were needed and read the data in chunks, > > compute the sufficient statistics on the fly, and not keep more than a > > couple of lines of the data in memory at once. Unfortunately it did not > > have much of a user interface, so when memory was cheap and datasets > > only medium sized it did not compete well, I guess it was just a bit too > > ahead of its time. > > > > Hope this helps, > > > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > [EMAIL PROTECTED] > > (801) 408-8111 > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > WenSui Liu > A lousy statistician who happens to know a little programming > (http://spaces.msn.com/statcompute/blog) > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
