Re: [R] read.csv and write.csv filtering for very big data ?

Duncan Murdoch Wed, 05 Jun 2013 05:08:50 -0700

On 13-06-05 12:08 AM, ivo welch wrote:

thx, greg.


chunk boundaries have meanings.  the reader needs to stop, and buffer one
line when it has crossed to the first line beyond the boundary.  it is also
problem that read.csv no longer works with files---readLines then has to do
the processing.  (starting read.csv over and over again with different
skip.lines is probably not a good idea for big files.)  it needs a lot of
smarts to intelligently append to a data frame.  (if the input is a data
matrix, this is much simpler, of course.)

As Greg said, you don't need to use skip.lines: just don't close thefile, and continue reading from where you stopped on the previous run.

If you don't know the size of blocks in advance this is harder, but it'snot really all that hard. The logic would be something like this:


open the file
read the first block including the header
while not done:
   if you have a complete block with some extra lines at the end,
   extract them and save them, then process the complete block.
   Initialize the next block with the extra lines.

   if the block is incomplete, read some more and append it
   to what you saved.
end while
close the file

Duncan Murdoch


exporting large input files to sqlite data bases makes sense when the same
file is used again and again, but probably not when it is a staged one-time
processor.  the disk consumption is too big.

the writer could become quasi-threaded by writing to multiple temp files
and then concatenating at the end, but this would be a nasty
solution...nothing like the parsimonious elegance and generality that a
built-in R filter function could provide.

----
Ivo Welch (ivo.we...@gmail.com)



On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com> wrote:

Some possibilities using existing tools.

If you create a file connection and open it before reading from it (or
writing to it), then functions like read.table and read.csv ( and
write.table for a writable connection) will read from the connection, but
not close and reset it.  This means that you could open 2 files, one for
reading and one for writing, then read in a chunk, process it, write it
out, then read in the next chunk, etc.

Another option would be to read the data into an ff object (ff package) or
into a database (SQLite for one) which could have the data accessed in
chunks, possibly even in parallel.


On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.we...@anderson.ucla.edu>wrote:

dear R wizards---

I presume this is a common problem, so I thought I would ask whether
this solution already exists and if not, suggest it.  say, a user has
a data set of x GB, where x is very big---say, greater than RAM.
fortunately, data often come sequentially in groups, and there is a
need to process contiguous subsets of them and write the results to a
new file.  read.csv and write.csv only work on FULL data sets.
read.csv has the ability to skip n lines and read only m lines, but
this can cross the subsets.  the useful solution here would be a
"filter" function that understands about chunks:

    filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...

a chunk would not exactly be a factor, because normal R factors can be
non-sequential in the data frame.  the filter.csv makes it very simple
to work on large data sets...almost SAS simple:

    filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date",
function(d) colMeans(d))
or
    filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
results.csv.bz2"), "date", function(d) d[ unique(d$date), ] )  ##
filter out obserations that have the same date again later

or some reasonable variant of this.

now that I can have many small chunks, it would be nice if this were
threadsafe, so

    mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...

with 'library(parallel)' could feed multiple cores the FUNprocess, and
make sure that the processes don't step on one another.  (why did R
not use a dot after "mc" for parallel lapply?)  presumably, to keep it
simple, mcfilter.csv would keep a counter of read chunks and block
write chinks until the next sequential chunk in order arrives.

just a suggestion...

/iaw

----
Ivo Welch (ivo.we...@gmail.com)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.csv and write.csv filtering for very big data ?

Reply via email to