Re: [R] read.csv and write.csv filtering for very big data ?

Duncan Murdoch Wed, 05 Jun 2013 07:53:21 -0700

On 05/06/2013 10:32 AM, ivo welch wrote:

I just tested read.csv with files...it still works. it can read oneline at a time (without col.names and with nrows). nice. it losesits type memory across reinvokcations, but this is usually not aproblem if one reads a few thousand lines inside a buffer function.this sort of function is useful only for big files anyway.

Surely you know the types of the columns? If you specify it in advance,read.table and relatives will be much faster.


Duncan Murdoch

is it possible to block write.csv across multiple threads in mclapply?or hook a single-thread function into the mclapply collector?


/iaw

On Wed, Jun 5, 2013 at 5:06 AM, Duncan Murdoch<murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:


    On 13-06-05 12:08 AM, ivo welch wrote:

        thx, greg.

        chunk boundaries have meanings.  the reader needs to stop, and
        buffer one
        line when it has crossed to the first line beyond the
        boundary.  it is also
        problem that read.csv no longer works with files---readLines
        then has to do
        the processing.  (starting read.csv over and over again with
        different
        skip.lines is probably not a good idea for big files.)  it
        needs a lot of
        smarts to intelligently append to a data frame.  (if the input
        is a data
        matrix, this is much simpler, of course.)


    As Greg said, you don't need to use skip.lines:  just don't close
    the file, and continue reading from where you stopped on the
    previous run.

    If you don't know the size of blocks in advance this is harder,
    but it's not really all that hard.  The logic would be something
    like this:

    open the file
    read the first block including the header
    while not done:
       if you have a complete block with some extra lines at the end,
       extract them and save them, then process the complete block.
       Initialize the next block with the extra lines.

       if the block is incomplete, read some more and append it
       to what you saved.
    end while
    close the file

    Duncan Murdoch


        exporting large input files to sqlite data bases makes sense
        when the same
        file is used again and again, but probably not when it is a
        staged one-time
        processor.  the disk consumption is too big.

        the writer could become quasi-threaded by writing to multiple
        temp files
        and then concatenating at the end, but this would be a nasty
        solution...nothing like the parsimonious elegance and
        generality that a
        built-in R filter function could provide.

        ----
        Ivo Welch (ivo.we...@gmail.com <mailto:ivo.we...@gmail.com>)



        On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com
        <mailto:538...@gmail.com>> wrote:

            Some possibilities using existing tools.

            If you create a file connection and open it before reading
            from it (or
            writing to it), then functions like read.table and
            read.csv ( and
            write.table for a writable connection) will read from the
            connection, but
            not close and reset it.  This means that you could open 2
            files, one for
            reading and one for writing, then read in a chunk, process
            it, write it
            out, then read in the next chunk, etc.

            Another option would be to read the data into an ff object
            (ff package) or
            into a database (SQLite for one) which could have the data
            accessed in
            chunks, possibly even in parallel.


            On Mon, Jun 3, 2013 at 4:59 PM, ivo welch
            <ivo.we...@anderson.ucla.edu
            <mailto:ivo.we...@anderson.ucla.edu>>wrote:

                dear R wizards---

                I presume this is a common problem, so I thought I
                would ask whether
                this solution already exists and if not, suggest it.
                 say, a user has
                a data set of x GB, where x is very big---say, greater
                than RAM.
                fortunately, data often come sequentially in groups,
                and there is a
                need to process contiguous subsets of them and write
                the results to a
                new file.  read.csv and write.csv only work on FULL
                data sets.
                read.csv has the ability to skip n lines and read only
                m lines, but
                this can cross the subsets.  the useful solution here
                would be a
                "filter" function that understands about chunks:

                    filter.csv <- function( in.csv, out.csv, chunk,
                FUNprocess ) ...

                a chunk would not exactly be a factor, because normal
                R factors can be
                non-sequential in the data frame.  the filter.csv
                makes it very simple
                to work on large data sets...almost SAS simple:

                    filter.csv( pipe('bzcat infile.csv.bz2'),
                "results.csv", "date",
                function(d) colMeans(d))
                or
                    filter.csv( pipe('bzcat infile.csv.bz2'),
                pipe("bzip -c >
                results.csv.bz2"), "date", function(d) d[
                unique(d$date), ] )  ##
                filter out obserations that have the same date again later

                or some reasonable variant of this.

                now that I can have many small chunks, it would be
                nice if this were
                threadsafe, so

                    mcfilter.csv <- function( in.csv, out.csv, chunk,
                FUNprocess ) ...

                with 'library(parallel)' could feed multiple cores the
                FUNprocess, and
                make sure that the processes don't step on one
                another.  (why did R
                not use a dot after "mc" for parallel lapply?)
                 presumably, to keep it
                simple, mcfilter.csv would keep a counter of read
                chunks and block
                write chinks until the next sequential chunk in order
                arrives.

                just a suggestion...

                /iaw

                ----
                Ivo Welch (ivo.we...@gmail.com
                <mailto:ivo.we...@gmail.com>)

                ______________________________________________
                R-help@r-project.org <mailto:R-help@r-project.org>
                mailing list
                https://stat.ethz.ch/mailman/listinfo/r-help
                PLEASE do read the posting guide
                http://www.R-project.org/posting-guide.html
                and provide commented, minimal, self-contained,
                reproducible code.




            --
            Gregory (Greg) L. Snow Ph.D.
            538...@gmail.com <mailto:538...@gmail.com>


                [[alternative HTML version deleted]]


        ______________________________________________
        R-help@r-project.org <mailto:R-help@r-project.org> mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide
        http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.csv and write.csv filtering for very big data ?

Reply via email to