On 05/06/2013 10:32 AM, ivo welch wrote:
I just tested read.csv with files...it still works. it can read one
line at a time (without col.names and with nrows). nice. it loses
its type memory across reinvokcations, but this is usually not a
problem if one reads a few thousand lines inside a buffer function.
this sort of function is useful only for big files anyway.
Surely you know the types of the columns? If you specify it in advance,
read.table and relatives will be much faster.
Duncan Murdoch
is it possible to block write.csv across multiple threads in mclapply?
or hook a single-thread function into the mclapply collector?
/iaw
On Wed, Jun 5, 2013 at 5:06 AM, Duncan Murdoch
<murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:
On 13-06-05 12:08 AM, ivo welch wrote:
thx, greg.
chunk boundaries have meanings. the reader needs to stop, and
buffer one
line when it has crossed to the first line beyond the
boundary. it is also
problem that read.csv no longer works with files---readLines
then has to do
the processing. (starting read.csv over and over again with
different
skip.lines is probably not a good idea for big files.) it
needs a lot of
smarts to intelligently append to a data frame. (if the input
is a data
matrix, this is much simpler, of course.)
As Greg said, you don't need to use skip.lines: just don't close
the file, and continue reading from where you stopped on the
previous run.
If you don't know the size of blocks in advance this is harder,
but it's not really all that hard. The logic would be something
like this:
open the file
read the first block including the header
while not done:
if you have a complete block with some extra lines at the end,
extract them and save them, then process the complete block.
Initialize the next block with the extra lines.
if the block is incomplete, read some more and append it
to what you saved.
end while
close the file
Duncan Murdoch
exporting large input files to sqlite data bases makes sense
when the same
file is used again and again, but probably not when it is a
staged one-time
processor. the disk consumption is too big.
the writer could become quasi-threaded by writing to multiple
temp files
and then concatenating at the end, but this would be a nasty
solution...nothing like the parsimonious elegance and
generality that a
built-in R filter function could provide.
----
Ivo Welch (ivo.we...@gmail.com <mailto:ivo.we...@gmail.com>)
On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com
<mailto:538...@gmail.com>> wrote:
Some possibilities using existing tools.
If you create a file connection and open it before reading
from it (or
writing to it), then functions like read.table and
read.csv ( and
write.table for a writable connection) will read from the
connection, but
not close and reset it. This means that you could open 2
files, one for
reading and one for writing, then read in a chunk, process
it, write it
out, then read in the next chunk, etc.
Another option would be to read the data into an ff object
(ff package) or
into a database (SQLite for one) which could have the data
accessed in
chunks, possibly even in parallel.
On Mon, Jun 3, 2013 at 4:59 PM, ivo welch
<ivo.we...@anderson.ucla.edu
<mailto:ivo.we...@anderson.ucla.edu>>wrote:
dear R wizards---
I presume this is a common problem, so I thought I
would ask whether
this solution already exists and if not, suggest it.
say, a user has
a data set of x GB, where x is very big---say, greater
than RAM.
fortunately, data often come sequentially in groups,
and there is a
need to process contiguous subsets of them and write
the results to a
new file. read.csv and write.csv only work on FULL
data sets.
read.csv has the ability to skip n lines and read only
m lines, but
this can cross the subsets. the useful solution here
would be a
"filter" function that understands about chunks:
filter.csv <- function( in.csv, out.csv, chunk,
FUNprocess ) ...
a chunk would not exactly be a factor, because normal
R factors can be
non-sequential in the data frame. the filter.csv
makes it very simple
to work on large data sets...almost SAS simple:
filter.csv( pipe('bzcat infile.csv.bz2'),
"results.csv", "date",
function(d) colMeans(d))
or
filter.csv( pipe('bzcat infile.csv.bz2'),
pipe("bzip -c >
results.csv.bz2"), "date", function(d) d[
unique(d$date), ] ) ##
filter out obserations that have the same date again later
or some reasonable variant of this.
now that I can have many small chunks, it would be
nice if this were
threadsafe, so
mcfilter.csv <- function( in.csv, out.csv, chunk,
FUNprocess ) ...
with 'library(parallel)' could feed multiple cores the
FUNprocess, and
make sure that the processes don't step on one
another. (why did R
not use a dot after "mc" for parallel lapply?)
presumably, to keep it
simple, mcfilter.csv would keep a counter of read
chunks and block
write chinks until the next sequential chunk in order
arrives.
just a suggestion...
/iaw
----
Ivo Welch (ivo.we...@gmail.com
<mailto:ivo.we...@gmail.com>)
______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org>
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
reproducible code.
--
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com <mailto:538...@gmail.com>
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.