Duncan Murdoch <[EMAIL PROTECTED]> wrote: For example, if you want to read lines 1000 through 1100, you'd do it like this: lines <- readLines("foo.txt", 1100)[1000:1100]
I created a dataset thus: # file foo.awk: BEGIN { s = "01" for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i) n = (27 * 1024 * 1024) / (length(s) + 1) for (i = 1; i <= n; i++) print s exit 0 } # shell command: mawk -f foo.awk /dev/null >BIG That is, each record contains 41 2-digit integers, and the number of records was chosen so that the total size was approximately 27 dimegabytes. The number of records turns out to be 230,175. > system.time(v <- readLines("BIG")) [1] 7.75 0.17 8.13 0.00 0.00 # With BIG already in the file system cache... > system.time(v <- readLines("BIG", 200000)[199001:200000]) [1] 11.73 0.16 12.27 0.00 0.00 What's the importance of this? First, experiments I shall not weary you with showed that the time to read N lines grows faster than N. Second, if you want to select the _last_ thousand lines, you have to read _all_ of them into memory. For real efficiency here, what's wanted is a variant of readLines where n is an index vector (a vector of non-negative integers, a vector of non-positive integers, or a vector of logicals) saying which lines should be kept. The function that would need changing is do_readLines() in src/main/connections.c, unfortunately I don't understand R internals well enough to do it myself (yet). As a matter of fact, that _still_ wouldn't yield real efficiency, because every character would still have to be read by the modified readLines(), and it reads characters using Rconn_fgetc(), which is what gives readLines() its power and utility, but certainly doesn't give it wings. (One of the fundamental laws of efficient I/O library design is to base it on block- or line- at-a-time transfers, not character-at-a-time.) The AWK program NR <= 199000 { next } {print} NR == 200000 { exit } extracts lines 199001:20000 in just 0.76 seconds, about 15 times faster. A C program to the same effect, using fgets(), took 0.39 seconds, or about 30 times faster than R. There are two fairly clear sources of overhead in the R code: (1) the overhead of reading characters one at a time through Rconn_fgetc() instead of a block or line at a time. mawk doesn't use fgets() for reading, and _does_ have the overhead of repeatedly checking a regular expression to determine where the end of the line is, which it is sensible enough to fast-path. (2) the overhead of allocating, filling in, and keeping, a whole lot of memory which is of no use whatever in computing the final result. mawk is actually fairly careful here, and only keeps one line at a time in the program shown above. Let's change it: NR <= 199000 {next} {a[NR] = $0} NR == 200000 {exit} END {for (i in a) print a[i]} That takes the time from 0.76 seconds to 0.80 seconds The simplest thing that could possibly work would be to add a function skipLines(con, n) which simply read and discarded n lines. result <- scan(textConnection(lines), list( .... )) > system.time(m <- scan(textConnection(v), integer(41))) Read 41000 items [1] 0.99 0.00 1.01 0.00 0.00 One whole second to read 41,000 numbers on a 500 MHz machine? > vv <- rep(v, 240) Is there any possibility of storing the data in (platform) binary form? Binary connections (R-data.pdf, section 6.5 "Binary connections") can be used to read binary-encoded data. I wrote a little C program to save out the 230175 records of 41 integers each in native binary form. Then in R I did > system.time(m <- readBin("BIN", integer(), n=230175*41, size=4)) [1] 0.57 0.52 1.11 0.00 0.00 > system.time(m <- matrix(data=m, ncol=41, byrow=TRUE)) [1] 2.55 0.34 2.95 0.00 0.00 Remember, this doesn't read a *sample* of the data, it reads *all* the data. It is so much faster than the alternatives in R that it just isn't funny. Trying scan() on the file took nearly 10 minutes before I killed it the other day, using readBin() is a thousand times faster than a simple scan() call on this particular data set. There has *got* to be a way of either generating or saving the data in binary form, using only "approved" Windows tools. Heck, it can probably be done using VBA. By the way, I've read most of the .pdf files I could find on the CRAN site, but haven't noticed any description of the R save-file format. Where should I have looked? (Yes, I know about src/main/saveload.c; I was hoping for some documentation, with maybe some diagrams.) ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help