Berton Gunter wrote: > Thanks, Duncan. > > I would say that your clarification defines what I mean by "incapable of > dealing with large data sets." To whit: one must handcraft solutions > working a chunk at a time versus having some sort of built-in virtual memory > procedure handle it automatically.
The general point of view in R is that that's the job of the operating system. R is currently artificially limited to 4 billion element vectors, but the elements could each be large. Presumably some future release of R will switch to 64 bit indexing, and then R will be able to handle petabytes of data if the operating system can provide the virtual memory. If you want to handle really big datasets transparently now, I think S-PLUS has something along the lines you are talking about, but I haven't tried it. Duncan Murdoch > But as Andy Liaw suggested to me off > list, maybe I fantacize the existence of any software that could deal with, > say, terrabytes or petabytes of data without such handcrafting. My son, the > computer scientist, tells me that the astronomers and physicists he works > with routinely produce such massive data sets, as do imaging folks of all > stripes I would imagine. > > I wonder if our 20th century statistical modeling paradigms are increasingly > out of step with such 21st century massive data realities... But that is a > much more vexing issue that does not belong here. > > -- Bert Gunter > Genentech Non-Clinical Statistics > South San Francisco, CA > > > >>-----Original Message----- >>From: Duncan Murdoch [mailto:[EMAIL PROTECTED] >>Sent: Friday, March 03, 2006 12:31 PM >>To: Berton Gunter >>Cc: [EMAIL PROTECTED]; 'R-Help' >>Subject: Re: [R] memory once again >> >>On 3/3/2006 2:42 PM, Berton Gunter wrote: >> >>>What you propose is not really a solution, as even if your >> >>data set didn't >> >>>break the modified precision, another would. And of course, >> >>there is a price >> >>>to be paid for reduced numerical precision. >>> >>>The real issue is that R's current design is incapable of >> >>dealing with data >> >>>sets larger than what can fit in physical memory (expert >>>comment/correction?). >> >>It can deal with big data sets, just not nearly as conveniently as it >>deals with ones that fit in memory. The most straightforward way is >>probably to put them in a database, and use RODBC or one of the >>database-specific packages to read the data in blocks. (You >>could also >>leave the data in a flat file and read it a block at a time >>from there, >>but the database is probably worth the trouble: other people >>have done >>the work involved in sorting, selecting, etc.) >> >>The main problem you'll run into is that almost none of the R >>functions >>know about databases, so you'll end up doing a lot of work to rewrite >>the algorithms to work one block at a time, or on a random sample of >>data, or whatever. >> >>The original poster didn't say what he wanted to do with his >>data, but >>if he only needs to work with a few variables at a time, he >>can easily >>fit an 820,000 x N dataframe in memory, for small values of >>N. Reading >>it in from a database would be easy. >> >>Duncan Murdoch >> >> > My understanding is that there is no way to change >> >>>this without a fundamental redesign of R. This means that >> >>you must either >> >>>live with R's limitations or use other software for "large" >> >>data sets. >> >>>-- Bert Gunter >>>Genentech Non-Clinical Statistics >>>South San Francisco, CA >>> >>>"The business of the statistician is to catalyze the >> >>scientific learning >> >>>process." - George E. P. Box >>> >>> >>> >>> >>>>-----Original Message----- >>>>From: [EMAIL PROTECTED] >>>>[mailto:[EMAIL PROTECTED] On Behalf Of Dimitri Joe >>>>Sent: Friday, March 03, 2006 11:28 AM >>>>To: R-Help >>>>Subject: [R] memory once again >>>> >>>>Dear all, >>>> >>>>A few weeks ago, I asked this list why small Stata files >>>>became huge R >>>>files. Thomas Lumley said it was because "Stata uses >> >>single-precision >> >>>>floating point by default and can use 1-byte and 2-byte >>>>integers. R uses >>>>double precision floating point and four-byte integers." And >>>>it seemed I >>>>couldn't do anythig about it. >>>> >>>>Is it true? I mean, isn't there a (more or less simple) >> >>way to change >> >>>>how R stores data (maybe by changing the source code and >>>>compiling it)? >>>> >>>>The reason why I insist in this point is because I am >> >>trying to work >> >>>>with a data frame with more than 820.000 observations and 80 >>>>variables. >>>>The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G >> >>RAM, Windows >> >>>>XP, I could't do the import using the read.dta() function >>>>from package >>>>foreign. With Stat Transfer I managed to convert the Stata >>>>file to a S >>>>file of 350Mb, but my machine still didn't manage to >> >>import it using >> >>>>read.S(). >>>> >>>>I even tried to "increase" my memory by memory.limit(4000), >>>>but it still >>>>didn't work. >>>> >>>>Regardless of the answer to my question, I'd appreciate to >> >>hear about >> >>>>your experience/suggestions in working with big files in R. >>>> >>>>Thank you for youR-Help, >>>> >>>>Dimitri Szerman >>>> >>>>______________________________________________ >>>>[email protected] mailing list >>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>PLEASE do read the posting guide! >>>>http://www.R-project.org/posting-guide.html >>>> >>> >>>______________________________________________ >>>[email protected] mailing list >>>https://stat.ethz.ch/mailman/listinfo/r-help >>>PLEASE do read the posting guide! >> >>http://www.R-project.org/posting-guide.html >> >> ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
