dear R wizards--- I presume this is a common problem, so I thought I would ask whether this solution already exists and if not, suggest it. say, a user has a data set of x GB, where x is very big---say, greater than RAM. fortunately, data often come sequentially in groups, and there is a need to process contiguous subsets of them and write the results to a new file. read.csv and write.csv only work on FULL data sets. read.csv has the ability to skip n lines and read only m lines, but this can cross the subsets. the useful solution here would be a "filter" function that understands about chunks:
filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... a chunk would not exactly be a factor, because normal R factors can be non-sequential in the data frame. the filter.csv makes it very simple to work on large data sets...almost SAS simple: filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date", function(d) colMeans(d)) or filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c > results.csv.bz2"), "date", function(d) d[ unique(d$date), ] ) ## filter out obserations that have the same date again later or some reasonable variant of this. now that I can have many small chunks, it would be nice if this were threadsafe, so mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... with 'library(parallel)' could feed multiple cores the FUNprocess, and make sure that the processes don't step on one another. (why did R not use a dot after "mc" for parallel lapply?) presumably, to keep it simple, mcfilter.csv would keep a counter of read chunks and block write chinks until the next sequential chunk in order arrives. just a suggestion... /iaw ---- Ivo Welch (ivo.we...@gmail.com) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.