On 06-Dec-05 John McHenry wrote: > I have very large csv files (up to 1GB each of ASCII text). I'd like to > be able to read them directly in to R. The problem I am having is with > the variable length of the data in each record. > > Here's a (simplified) example: > > $ cat foo.csv > Name,Start Month,Data > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0 > .3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > The records consist of rows with some set comma-separated fields > (e.g. the "Name" & "Start Month" fields in the above) and then the data > follow as a variable-length list of comma-separated values until a new > line is encountered.
While you may well get a good R solution from the experts, in such a situation (as in so many) I would be tempted to pre-process the file with 'awk' (installed by default on Unix/Linux systems, available also for Windows). The following will give you a CSV file with a constant number of fields per line. While this does not eliminate the NAs which you apparently find unsightly, it should be a fast and clean way of doing the basic job, since it a line-by-line operation in two passes, so there should be no question. of choking the system (unless you run out of HD space as a result of creating the second file). Two passes, on the lines of Pass 1: cat foo.csv | awk ' BEGIN{FS=","; n=0} {m=NF; if(m>n){n=m}} END{print n} ' which gives you the maximum number of fields in any line. Suppose (for example) that this number is 37. Then Pass 2: cat foo.csv | awk -v maxF=37 ' BEGIN{FS=","; OFS=","} {if(NF<maxF){$maxF=""}} {print $0} ' > newfoo.csv Tiny example: 1) See foo.csv cat foo.csv 1 1,2 1,2,3 1,2,3,4 1,2 2) Pass 1: cat foo.csv | awk ' BEGIN{FS=","; n=0} {m=NF; if(m>n){n=m}} END{print n} ' > 4 3) So we need 4 fields per line. With maxF=4, Pass 2: cat foo.csv | awk -v maxF=4 ' BEGIN{FS=","; OFS=","} {if(NF<maxF){$maxF=""}} {print $0} ' > newfoo.csv 4) See newfoo.csv cat newfoo.csv 1,,, 1,2,, 1,2,3, 1,2,3,4 1,2,, So you now have a CSV file with a constant number of fields per line. This doesn't make it into lists, though. Hoping this helps, Ted. > > Now I can use e.g. > > fileName="foo.csv" > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > which does the job nicely: > > V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 > V11 V12 V13 V14 V15 V16 V17 > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA NA > NA NA NA NA NA NA NA > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 1.8546 > 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > but the problem is with files on the order of 1GB this either > crunches for ever or runs out of memory trying ... plus having all > those NAs isn't too pretty to look at. > > (I have a MATLAB version that can read this stuff into an array of > cells in about 3 minutes). > > I really want a fast way to read the data part into a list; that way > I can access data in the array of lists containing the records by doing > something ta[[i]]$data. > > Ideas? > > Thanks, > > Jack. > > > --------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Dec-05 Time: 18:08:54 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html