On Wed, 4 Mar 2009, Vadlamani, Satish {FLNA} wrote:
Hi:
Sorry if this is a double post. I posted the same thing this morning and did
not see it.
I just started using R and am asking the following questions so that I can plan
for the future when I may have to analyze volume data.
1) What are the limitations of R when it comes to handling large datasets?
Say for example something like 200M rows and 15 columns data frame (between >1.5 to 2 GB in size)? Will the limitation be based on the specifications of
the hardware or R itself?
It depends a lot on what you want to do. The default situation in R is that
all the data are loaded into memory, in which case the rule of thumb is that
you want data sets no larger than 1/3 of memory. If you have, say, a system
with 8Gb memory and a 64-bit version of R you should be ok.
It is often possible to work with much larger data sets than this, you just
need to arrange for the whole thing not to be loaded simultaneously. The right
strategy depends on the problem.
For example, linear and generalized linear models on large data sets can be
fitted with the biglm package. The various database interface packages and the
packages for netCDF and HDF5 allow subsets of a data set to be loaded easily.
Packages such as bigmemory and ff allow at least some operations to be carried
out on file-backed data objects.
2) Is R 32 bit compiled or 64 bit (on say Windows and AIX)
On AIX, 64 bit. On Windows, currently only 32-bit although there is work
towards a 64-bit version.
4) Should I be looking at SAS also only for this reason (we do have SAS
in-house but the problem is that I am still not sure what we have license for,
etc.)
I would guess that it would be cheaper to buy hardware on which the problem can
be solved in R than to buy a SAS license (last time I looked, suitable
rack-mount Linux boxes were under USD3000). If you already have SAS available
it would be worth looking at it. For some large-data problems it will be faster
or easier to use, but not for all.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
[email protected] University of Washington, Seattle
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.