On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote: > On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote: > > Hi all, > > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > > Each line is a string characters. Specifically I would like to randomly > > select 3000 lines. For smaller files, what I'm doing is: > > > > trs <- scan("myfile", what= character(), sep = "\n") > > trs<- trs[sample(length(trs), 3000)] > > > > And this works OK; however my computer seems not able to handle the 1.8 G > > file. > > I thought of an alternative way that not require to read the whole file: > > > > sel <- sample(1:900000, 3000) > > for (i in 1:3000) { > > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > > write(un, "myfile_short", append=TRUE) > > } > > > > This works on my computer; however it is extremely slow; it read one line > > each time. It is been running for 25 hours and I think it has done less than > > half of the file (Yes, probably I do not have a very good computer and I'm > > working under Windows ...). > > So my question is: do you know any other faster way to do this? > > Thanks in advance > > > > Juli > > > Juli, > > I don't have a file to test this on, so caveat emptor. > > The problem with the approach above, is that you are re-reading the > source file, once per line, or 3000 times. In addition, each read is > likely going through half the file on average to locate the randomly > selected line. Thus, the reality is that you are probably reading on the > order of: > > > 3000 * 450000 > [1] 1.35e+09 > > lines in the file, which of course if going to be quite slow. > > In addition, you are also writing to the target file 3000 times. > > The basic premise with this approach below, is that you are in effect > creating a sequential file cache in an R object. Reading large chunks of > the source file into the cache. Then randomly selecting rows within the > cache and then writing out the selected rows. > > Thus, if you can read 100,000 rows at once, you would have 9 reads of > the source file, and 9 writes of the target file. > > The key thing here is to ensure that the offsets within the cache and > the corresponding random row values are properly set. > > Here's the code: > > # Generate the random values > sel <- sample(1:900000, 3000) > > # Set up a sequence for the cache chunks > # Presume you can read 100,000 rows at once > Cuts <- seq(0, 900000, 100000) > > # Loop over the length of Cuts, less 1 > for (i in seq(along = Cuts[-1])) > { > # Get a 100,000 row chunk, skipping rows > # as appropriate for each subsequent chunk > Chunk <- scan("myfile", what = character(), sep = "\n", > skip = Cuts[i], nlines = 100000) > > # set up a row sequence for the current > # chunk > Rows <- (Cuts[i] + 1):(Cuts[i + 1]) > > # Are any of the random values in the > # current chunk? > Chunk.Sel <- sel[which(sel %in% Rows)] > > # If so, get them > if (length(Chunk.Sel) > 0) > { > Write.Rows <- Chunk[sel - Cuts[i]]
Quick typo correction: The last line above should be: Write.Rows <- Chunk[sel - Cuts[i], ] > # Now write them out > write(Write.Rows, "myfile_short", append = TRUE) > } > } > > > As noted, I have not tested this, so there may yet be additional ways to > save time with file seeks, etc. If that's the only error in the code... :-) Marc ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.