I just tested read.csv with files...it still works. it can read one line at a time (without col.names and with nrows). nice. it loses its type memory across reinvokcations, but this is usually not a problem if one reads a few thousand lines inside a buffer function. this sort of function is useful only for big files anyway.
is it possible to block write.csv across multiple threads in mclapply? or hook a single-thread function into the mclapply collector? /iaw On Wed, Jun 5, 2013 at 5:06 AM, Duncan Murdoch <murdoch.dun...@gmail.com>wrote: > On 13-06-05 12:08 AM, ivo welch wrote: > >> thx, greg. >> >> chunk boundaries have meanings. the reader needs to stop, and buffer one >> line when it has crossed to the first line beyond the boundary. it is >> also >> problem that read.csv no longer works with files---readLines then has to >> do >> the processing. (starting read.csv over and over again with different >> skip.lines is probably not a good idea for big files.) it needs a lot of >> smarts to intelligently append to a data frame. (if the input is a data >> matrix, this is much simpler, of course.) >> > > As Greg said, you don't need to use skip.lines: just don't close the > file, and continue reading from where you stopped on the previous run. > > If you don't know the size of blocks in advance this is harder, but it's > not really all that hard. The logic would be something like this: > > open the file > read the first block including the header > while not done: > if you have a complete block with some extra lines at the end, > extract them and save them, then process the complete block. > Initialize the next block with the extra lines. > > if the block is incomplete, read some more and append it > to what you saved. > end while > close the file > > Duncan Murdoch > > >> exporting large input files to sqlite data bases makes sense when the same >> file is used again and again, but probably not when it is a staged >> one-time >> processor. the disk consumption is too big. >> >> the writer could become quasi-threaded by writing to multiple temp files >> and then concatenating at the end, but this would be a nasty >> solution...nothing like the parsimonious elegance and generality that a >> built-in R filter function could provide. >> >> ---- >> Ivo Welch (ivo.we...@gmail.com) >> >> >> >> On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com> wrote: >> >> Some possibilities using existing tools. >>> >>> If you create a file connection and open it before reading from it (or >>> writing to it), then functions like read.table and read.csv ( and >>> write.table for a writable connection) will read from the connection, but >>> not close and reset it. This means that you could open 2 files, one for >>> reading and one for writing, then read in a chunk, process it, write it >>> out, then read in the next chunk, etc. >>> >>> Another option would be to read the data into an ff object (ff package) >>> or >>> into a database (SQLite for one) which could have the data accessed in >>> chunks, possibly even in parallel. >>> >>> >>> On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.we...@anderson.ucla.edu>* >>> *wrote: >>> >>> dear R wizards--- >>>> >>>> I presume this is a common problem, so I thought I would ask whether >>>> this solution already exists and if not, suggest it. say, a user has >>>> a data set of x GB, where x is very big---say, greater than RAM. >>>> fortunately, data often come sequentially in groups, and there is a >>>> need to process contiguous subsets of them and write the results to a >>>> new file. read.csv and write.csv only work on FULL data sets. >>>> read.csv has the ability to skip n lines and read only m lines, but >>>> this can cross the subsets. the useful solution here would be a >>>> "filter" function that understands about chunks: >>>> >>>> filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... >>>> >>>> a chunk would not exactly be a factor, because normal R factors can be >>>> non-sequential in the data frame. the filter.csv makes it very simple >>>> to work on large data sets...almost SAS simple: >>>> >>>> filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date", >>>> function(d) colMeans(d)) >>>> or >>>> filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c > >>>> results.csv.bz2"), "date", function(d) d[ unique(d$date), ] ) ## >>>> filter out obserations that have the same date again later >>>> >>>> or some reasonable variant of this. >>>> >>>> now that I can have many small chunks, it would be nice if this were >>>> threadsafe, so >>>> >>>> mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... >>>> >>>> with 'library(parallel)' could feed multiple cores the FUNprocess, and >>>> make sure that the processes don't step on one another. (why did R >>>> not use a dot after "mc" for parallel lapply?) presumably, to keep it >>>> simple, mcfilter.csv would keep a counter of read chunks and block >>>> write chinks until the next sequential chunk in order arrives. >>>> >>>> just a suggestion... >>>> >>>> /iaw >>>> >>>> ---- >>>> Ivo Welch (ivo.we...@gmail.com) >>>> >>>> ______________________________**________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/**posting-guide.html<http://www.R-project.org/posting-guide.html> >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >>> >>> -- >>> Gregory (Greg) L. Snow Ph.D. >>> 538...@gmail.com >>> >>> >> [[alternative HTML version deleted]] >> >> >> ______________________________**________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide http://www.R-project.org/** >> posting-guide.html <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.