On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > Greg, > As far as I understand, SAS is more efficient handling large data > probably than S+/R. Do you have any idea why?
SAS originated at a time when large data sets were stored on magnetic tape and the only reasonable way to process them was sequentially. Thus most statistics procedures in SAS act as filters, processing one record at a time and accumulating summary information. In the past SAS performed a least squares fit by accumulating the crossproduct of [X:y] and then using the using the sweep operator to reduce that matrix. For such an approach the number of observations does not affect the amount of storage required. Adding observations just requires more time. This works fine (although there are numerical disadvantages to this approach - try mentioning the sweep operator to an expert in numerical linear algebra - you get a blank stare) as long as the operations that you wish to perform fit into this model. Making the desired operations fit into the model is the primary reason for the awkwardness in many SAS analyses. The emphasis in R is on flexibility and the use of good numerical techniques - not on processing large data sets sequentially. The algorithms used in R for most least squares fits generate and analyze the complete model matrix instead of summary quantities. (The algorithms in the biglm package are a compromise that work on horizontal sections of the model matrix.) If your only criterion for comparison is the ability to work with very large data sets performing operations that can fit into the filter model used by SAS then SAS will be a better choice. However you do lock yourself into a certain set of operations and you are doing it to save memory, which is a commodity that decreases in price very rapidly. As mentioned in other replies, for many years the majority of SAS uses are for data manipulation rather than for statistical analysis so the filter model has been modified in later versions. > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of > > > Bi-Info (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; [email protected] > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of memory > > > like R. I don't get the point. Isn't there enough swap space > > > for S? (Who cares > > > anyway: it works, isn't it?) Or are there any problems with S > > > and large datasets? I don't get it. You use them, Greg. So > > > you might discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start swapping > > memory to disk, but the OS does not know what parts of memory correspond > > to which objects, so it is entirely possible that the chunk swapped to > > disk contains parts of different data objects, so when you need one of > > those objects again, everything needs to be swapped back in. This is > > very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since it does some > > of its own swapping to disk it can be more efficient by swapping single > > data objects (data frames, etc.). Also, since S-PLUS is already saving > > everything to disk, it does not actually need to do a full swap, it can > > just look and see that a particular data frame has not been used for a > > while, know that it is already saved on the disk, and unload it from > > memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping data > > on the disk until needed. > > > > The better approach for large data sets is to only have some of the data > > in memory at a time and to automatically read just the parts that you > > need. So for big datasets it is recommended to have the actual data > > stored in a database and use one of the database connection packages to > > only read in the subset that you need. The SQLiteDF package for R is > > working on automating this process for R. There are also the bigdata > > module for S-PLUS and the biglm package for R have ways of doing some of > > the common analyses using chunks of data at a time. This idea is not > > new. There was a program in the late 1970s and 80s called Rummage by > > Del Scott (I guess technically it still exists, I have a copy on a 5.25" > > floppy somewhere) that used the approach of specify the model you wanted > > to fit first, then specify the data file. Rummage would then figure out > > which sufficient statistics were needed and read the data in chunks, > > compute the sufficient statistics on the fly, and not keep more than a > > couple of lines of the data in memory at once. Unfortunately it did not > > have much of a user interface, so when memory was cheap and datasets > > only medium sized it did not compete well, I guess it was just a bit too > > ahead of its time. > > > > Hope this helps, > > > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > [EMAIL PROTECTED] > > (801) 408-8111 > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > WenSui Liu > A lousy statistician who happens to know a little programming > (http://spaces.msn.com/statcompute/blog) > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
