If you can show me the equivalent Python code in as few lines that perform much faster, I'd very much appreciate it. I had been trying to find an "excuse" to learn Python, but so far have found what I can do in R quite adequate. Also, it's much easier to keep track of work flow when everything is done in one place (R in my case).
Andy From: Steve Miller > > Why torture yourself and probably get bad performance in the > process? You > should handle the data consolidation in python or ruby, which > are much more > suited to this type of task, piping the results to R. > > Steve Miller > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy > Sent: Thursday, May 11, 2006 5:50 AM > To: [EMAIL PROTECTED]; r-help > Subject: Re: [R] data input strategy - lots of csv files > > This is what I would try: > > csvlist <- list.files(pattern="csv$") > bigblob <- lapply(csvlist, read.csv, ...) > ## Get all dates that appear in any one of them. > all.dates <- unique(unlist(lapply(bigblob, "[[", 1))) > bigdata <- matrix(NA, length(all.dates), length(bigblob)) > dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant) > ## loop through bigblob and populate corresponding columns > ## of bigmatrix with the matching dates. > for (i in seq(along=bigblob)) { > bigmatrix[as.character(bigblob[[i]][, 1]), i] <- > bigblob[[i]][, columnwithdata] > } > > This is obviously untested, so hope it's of some help. > > Andy > > From: Sean O'Riordain > > > > Good morning, > > I have currently 63 .csv files most of which have lines which > > look like > > 01/06/05,23445 > > Though some files have two numbers beside each date. There > > are missing values, and currently the longest file has 318 rows. > > > > (merge() is losing the head and doing runaway memory > > allocation - but thats another question - I'm still trying to > > pin that issue down and make a small repeatable example) > > > > Currently I'm reading in these files with lines like > > a1 <- read.csv("daft_file_name_1.csv",header=F) > > ... > > a63 <- read.csv("another_silly_filename_63.csv",header=F) > > > > and then i'm naming the columns in these like... > > names(a1)[2] <- "silly column name" > > ... > > names(a63)[2] <- "daft column name" > > > > then trying to merge()... > > atot <- merge(a1, a2, all=T) > > and then using language manipulation to loop > > atot <- merge(atot, a3, all=T) > > ... > > atot <- merge(atot, a63, all=T) > > etc... > > > > followed by more language manipulation > > for() { > > rm(a1) > > } etc... > > > > i.e. > > for (i in 2:63) { > > atot <- merge(atot, eval(parse(text=paste("a", i, > > sep=""))), all=T) > > # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > > > cat("i is ", i, gc(), "\n") > > > > # now delete these 63 temporary objects... > > # e.g. should look like rm(a33) > > eval(parse(text=paste("rm(a",i,")", sep=""))) } > > > > eventually getting a dataframe with the first column being > > the date, and the subsequent 63 columns being the data... > > with missing values coded as NA... > > > > so my question is... is there a better strategy for reading > > in lots of small files (only a few kbytes each) like that > > which are timeseries with missing data... which doesn't go > > through the above awkwardness (and language manipulation) but > > still ends up with a nice data.frame with NA values correctly > > coded etc. > > > > Many thanks, > > Sean O'Riordain > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
