Hi Gaurav,

Please prefer MySQL for dealing with that amount of data & connect through 
RODBC to pull it in directly ( or prefer to subset it for the part you want to 
do statistical analysis on)
You can shuffle data to and fro between SQL and R.
Utilize SQL for data cleaning and standardization - use R for the exploratory 
data analysis and statistical analysis.

Regards,
Amul Badjatya
www(dot)amulbadjatya(dot)com

Sent from my Windows Phone
________________________________
From: Randall Pruim
Sent: 18-01-2013 20:43
To: gaurav singh
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [R-sig-teaching] importing and processing large datasets in R
read.table is working harder than necessary because it is parsing the file and 
checking for the "least common denominator" data type to convert each column 
to.  You can help it a bit by telling it what format to use for each column. 
using the colClasses arguemnt:

colClasses
character. A vector of classes to be assumed for the columns. Recycled as 
necessary, or if the character vector is named, unspecified values are taken to 
be NA.

Possible values are NA (the default, when type.convertis used), "NULL" (when 
the column is skipped), one of the atomic vector classes (logical, integer, 
numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise 
there needs to be an as method (from packagemethods) for conversion from 
"character" to the specified formal class.


Assuming that is sufficient to read the data without trying your patience, the 
first thing you should so is save() the data.  That will store it in an R 
format that will subsequently load much more quickly (and also keeps track of 
more information than a flat file can).  I've used this trick successfully with 
large genetics data sets that didn't change frequently but were accessed 
frequently.  (Sorry, it's been a while, so I don't recall their size off hand.)

I don't know if there are externals tools that can convert csv files to Rdata 
files without requiring that initial read in R, there may be.  I also haven't 
done the experiment to see whether csv files read more quickly than fixed with 
files that use spaces to pad fields.  I suspect there must be at least some 
speed up, but perhaps not enough to matter.

There are also packages that allow you to work with large data sets without 
reading them into memory.  I've not used them, but they are worth exploring.

Finally, http://www.revolutionanalytics.com/ is a company that advertises 
support for large data.  They may offer some of there things free for academics 
(I don't recall).  I've seen them give demos at meetings, and at least in that 
context, it looked like they were doing some clever things.  You've now 
exhausted my knowledge of their products...

---rjp

PS.  I"m not sure that sig-teaching is really the correct place for this 
question -- unless, I suppose, you are teaching a course on big data.  You 
might get more responses from a different list.

On Jan 18, 2013, at 9:53 AM, gaurav singh wrote:

> Hi Everyone,
>
> I am a little new to R and the first problem I am facing is the dilemma
> whether R is suitable for files of size 2 GB's and slightly more then 2
> Million rows. When I try importing the data using read.table, it seems to
> take forever and I have to cancel the command. Are there any special
> techniques or methods which i can use or some tricks of the game that I
> should keep in mind in order to be able to do data analysis on such large
> files using R?
>
> Cheers :-)
>
>
> --
> Regards
> Gaurav Singh
>
>
>
> --
> Regards
> Gaurav Singh
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> [email protected]<mailto:[email protected]> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching


        [[alternative HTML version deleted]]

_______________________________________________
[email protected]<mailto:[email protected]> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. Any unauthorized review, 
use, disclosure, dissemination, forwarding, printing or copying of this email 
or any action taken in reliance on this e-mail is strictly prohibited and may 
be unlawful.
        [[alternative HTML version deleted]]

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

Reply via email to