Please prefer MySQL for dealing with that amount of data & connect through RODBC to pull it in directly ( or prefer to subset it for the part you want to do statistical analysis on) You can shuffle data to and fro between SQL and R. Utilize SQL for data cleaning and standardization - use R for the exploratory data analysis and statistical analysis.
Regards, Amul Badjatya www(dot)amulbadjatya(dot)com Sent from my Windows Phone ________________________________ From: Randall Pruim Sent: 18-01-2013 20:43 To: gaurav singh Cc: [email protected] Subject: Re: [R-sig-teaching] importing and processing large datasets in R read.table is working harder than necessary because it is parsing the file and checking for the "least common denominator" data type to convert each column to. You can help it a bit by telling it what format to use for each column. using the colClasses arguemnt: colClasses character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA. Possible values are NA (the default, when type.convertis used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from packagemethods) for conversion from "character" to the specified formal class. Assuming that is sufficient to read the data without trying your patience, the first thing you should so is save() the data. That will store it in an R format that will subsequently load much more quickly (and also keeps track of more information than a flat file can). I've used this trick successfully with large genetics data sets that didn't change frequently but were accessed frequently. (Sorry, it's been a while, so I don't recall their size off hand.) I don't know if there are externals tools that can convert csv files to Rdata files without requiring that initial read in R, there may be. I also haven't done the experiment to see whether csv files read more quickly than fixed with files that use spaces to pad fields. I suspect there must be at least some speed up, but perhaps not enough to matter. There are also packages that allow you to work with large data sets without reading them into memory. I've not used them, but they are worth exploring. Finally, http://www.revolutionanalytics.com/ is a company that advertises support for large data. They may offer some of there things free for academics (I don't recall). I've seen them give demos at meetings, and at least in that context, it looked like they were doing some clever things. You've now exhausted my knowledge of their products... ---rjp PS. I"m not sure that sig-teaching is really the correct place for this question -- unless, I suppose, you are teaching a course on big data. You might get more responses from a different list. On Jan 18, 2013, at 9:53 AM, gaurav singh wrote: > Hi Everyone, > > I am a little new to R and the first problem I am facing is the dilemma > whether R is suitable for files of size 2 GB's and slightly more then 2 > Million rows. When I try importing the data using read.table, it seems to > take forever and I have to cancel the command. Are there any special > techniques or methods which i can use or some tricks of the game that I > should keep in mind in order to be able to do data analysis on such large > files using R? > > Cheers :-) > > > -- > Regards > Gaurav Singh > > > > -- > Regards > Gaurav Singh > > [[alternative HTML version deleted]] > > _______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-sig-teaching [[alternative HTML version deleted]] _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-teaching This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. [[alternative HTML version deleted]] _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
