Re: [R] read.csv and write.csv filtering for very big data ?

ivo welch Wed, 05 Jun 2013 17:41:11 -0700

I just tested read.csv with files...it still works.  it can read one line
at a time (without col.names and with nrows).  nice.  it loses its type
memory across reinvokcations, but this is usually not a problem if one
reads a few thousand lines inside a buffer function.  this sort of function
is useful only for big files anyway.


is it possible to block write.csv across multiple threads in mclapply?  or
hook a single-thread function into the mclapply collector?

/iaw



On Wed, Jun 5, 2013 at 5:06 AM, Duncan Murdoch <murdoch.dun...@gmail.com>wrote:

> On 13-06-05 12:08 AM, ivo welch wrote:
>
>> thx, greg.
>>
>> chunk boundaries have meanings.  the reader needs to stop, and buffer one
>> line when it has crossed to the first line beyond the boundary.  it is
>> also
>> problem that read.csv no longer works with files---readLines then has to
>> do
>> the processing.  (starting read.csv over and over again with different
>> skip.lines is probably not a good idea for big files.)  it needs a lot of
>> smarts to intelligently append to a data frame.  (if the input is a data
>> matrix, this is much simpler, of course.)
>>
>
> As Greg said, you don't need to use skip.lines:  just don't close the
> file, and continue reading from where you stopped on the previous run.
>
> If you don't know the size of blocks in advance this is harder, but it's
> not really all that hard.  The logic would be something like this:
>
> open the file
> read the first block including the header
> while not done:
>    if you have a complete block with some extra lines at the end,
>    extract them and save them, then process the complete block.
>    Initialize the next block with the extra lines.
>
>    if the block is incomplete, read some more and append it
>    to what you saved.
> end while
> close the file
>
> Duncan Murdoch
>
>
>> exporting large input files to sqlite data bases makes sense when the same
>> file is used again and again, but probably not when it is a staged
>> one-time
>> processor.  the disk consumption is too big.
>>
>> the writer could become quasi-threaded by writing to multiple temp files
>> and then concatenating at the end, but this would be a nasty
>> solution...nothing like the parsimonious elegance and generality that a
>> built-in R filter function could provide.
>>
>> ----
>> Ivo Welch (ivo.we...@gmail.com)
>>
>>
>>
>> On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com> wrote:
>>
>>  Some possibilities using existing tools.
>>>
>>> If you create a file connection and open it before reading from it (or
>>> writing to it), then functions like read.table and read.csv ( and
>>> write.table for a writable connection) will read from the connection, but
>>> not close and reset it.  This means that you could open 2 files, one for
>>> reading and one for writing, then read in a chunk, process it, write it
>>> out, then read in the next chunk, etc.
>>>
>>> Another option would be to read the data into an ff object (ff package)
>>> or
>>> into a database (SQLite for one) which could have the data accessed in
>>> chunks, possibly even in parallel.
>>>
>>>
>>> On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.we...@anderson.ucla.edu>*
>>> *wrote:
>>>
>>>  dear R wizards---
>>>>
>>>> I presume this is a common problem, so I thought I would ask whether
>>>> this solution already exists and if not, suggest it.  say, a user has
>>>> a data set of x GB, where x is very big---say, greater than RAM.
>>>> fortunately, data often come sequentially in groups, and there is a
>>>> need to process contiguous subsets of them and write the results to a
>>>> new file.  read.csv and write.csv only work on FULL data sets.
>>>> read.csv has the ability to skip n lines and read only m lines, but
>>>> this can cross the subsets.  the useful solution here would be a
>>>> "filter" function that understands about chunks:
>>>>
>>>>     filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>>>
>>>> a chunk would not exactly be a factor, because normal R factors can be
>>>> non-sequential in the data frame.  the filter.csv makes it very simple
>>>> to work on large data sets...almost SAS simple:
>>>>
>>>>     filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date",
>>>> function(d) colMeans(d))
>>>> or
>>>>     filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
>>>> results.csv.bz2"), "date", function(d) d[ unique(d$date), ] )  ##
>>>> filter out obserations that have the same date again later
>>>>
>>>> or some reasonable variant of this.
>>>>
>>>> now that I can have many small chunks, it would be nice if this were
>>>> threadsafe, so
>>>>
>>>>     mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>>>
>>>> with 'library(parallel)' could feed multiple cores the FUNprocess, and
>>>> make sure that the processes don't step on one another.  (why did R
>>>> not use a dot after "mc" for parallel lapply?)  presumably, to keep it
>>>> simple, mcfilter.csv would keep a counter of read chunks and block
>>>> write chinks until the next sequential chunk in order arrives.
>>>>
>>>> just a suggestion...
>>>>
>>>> /iaw
>>>>
>>>> ----
>>>> Ivo Welch (ivo.we...@gmail.com)
>>>>
>>>> ______________________________**________________
>>>> R-help@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/**posting-guide.html<http://www.R-project.org/posting-guide.html>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>> --
>>> Gregory (Greg) L. Snow Ph.D.
>>> 538...@gmail.com
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________**________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.csv and write.csv filtering for very big data ?

Reply via email to