Murray Jorgensen <[EMAIL PROTECTED]> wrote: "Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not. I repeat my earlier observation. The AMOUNT OF DATA is easily handled a typical desktop machine these days. The problem is not the amount of data. The problem is HOW LONG IT TAKES TO READ. I made several attempts to read the test file I created yesterday, and each time gave up impatiently after 5+ minutes elapsed time. I tried again today (see below) and went away to have a cop of tea &c; took nearly 10 minute that time and still hadn't finished. 'mawk' read _and processed_ the same file happily in under 30 seconds.
One quite serious alternative would be to write a little C function to read the file into an array, and call that from R. > system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41)) [1] 3.28 0.79 4.28 0.00 0.00 > system.time(save(m, file="m.bin")) [1] 8.44 0.54 9.08 0.00 0.00 > m <- NULL > system.time(load("m.bin")) [1] 11.25 0.19 11.51 0.00 0.00 > length(m) [1] 10250000 The binary file m.bin is 41 million bytes. This little transcript shows that a data set of this size can be comfortably read from disc in under 12 seconds, on the same machine where scan() took about 50 times as long before I killed it. So yet another alternative is to write a little program that converts the data file to R binary format, and then just read the whole thing in. I think readers will agree that 12 seconds on a 500MHz machine counts as "takes less than 30s". It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets. R is *good* at it, it's just not *quick*. Trying to select a subset in scan() or read.table() wouldn't help all that much, because it would still have to *scan* the data to determine what to skip. Two more times: An unoptimised C program writing 0:(41*250000-1) as a file of 41-number lines: f% time a.out >m.txt 13.0u 1.0s 0:14 94% 0+0k 0+0io 0pf+0w > system.time(m <- read.table("m.txt", header=FALSE)) ^C Timing stopped at: 552.01 15.48 584.51 0 0 To my eyes, src/main/scan.c shows no signs of having been tuned for speed. The goals appear to have been power (the R scan() function has LOTS of options) and correctness, which are perfectly good goals, and the speed of scan() and read.table() with modest data sizes is quite good enough. The huge ratio (>552)/(<30) for R/mawk does suggest that there may be room for some serious improvement in scan(), possibly by means of some extra hints about total size, possibly by creating a fast path through the code. Of course the big point is that however long scan() takes to read the data set, it only has to be done once. Leave R running overnight and in the morning save the dataset out as an R binary file using save(). Then you'll be able to load it again quickly. ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help