On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote: > On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
> > Juli, > > > > I don't have a file to test this on, so caveat emptor. > > > > The problem with the approach above, is that you are re-reading the > > source file, once per line, or 3000 times. In addition, each read is > > likely going through half the file on average to locate the randomly > > selected line. Thus, the reality is that you are probably reading on the > > order of: > > > > > 3000 * 450000 > > [1] 1.35e+09 > > > > lines in the file, which of course if going to be quite slow. > > > > In addition, you are also writing to the target file 3000 times. > > > > The basic premise with this approach below, is that you are in effect > > creating a sequential file cache in an R object. Reading large chunks of > > the source file into the cache. Then randomly selecting rows within the > > cache and then writing out the selected rows. > > > > Thus, if you can read 100,000 rows at once, you would have 9 reads of > > the source file, and 9 writes of the target file. > > > > The key thing here is to ensure that the offsets within the cache and > > the corresponding random row values are properly set. > > > > Here's the code: > > > > # Generate the random values > > sel <- sample(1:900000, 3000) > > > > # Set up a sequence for the cache chunks > > # Presume you can read 100,000 rows at once > > Cuts <- seq(0, 900000, 100000) > > > > # Loop over the length of Cuts, less 1 > > for (i in seq(along = Cuts[-1])) > > { > > # Get a 100,000 row chunk, skipping rows > > # as appropriate for each subsequent chunk > > Chunk <- scan("myfile", what = character(), sep = "\n", > > skip = Cuts[i], nlines = 100000) > > > > # set up a row sequence for the current > > # chunk > > Rows <- (Cuts[i] + 1):(Cuts[i + 1]) > > > > # Are any of the random values in the > > # current chunk? > > Chunk.Sel <- sel[which(sel %in% Rows)] > > > > # If so, get them > > if (length(Chunk.Sel) > 0) > > { > > Write.Rows <- Chunk[sel - Cuts[i]] > > > Quick typo correction: > > The last line above should be: > > Write.Rows <- Chunk[sel - Cuts[i], ] > > > > # Now write them out > > write(Write.Rows, "myfile_short", append = TRUE) > > } > > } > > OK, I knew it was too good to be true... One more correction on that same line: Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ] For clarity, here is the full set of code: # Generate the random values sel <- sample(900000, 3000) # Set up a sequence for the cache chunks # Presume you can read 100,000 rows at once Cuts <- seq(0, 900000, 100000) # Loop over the length of Cuts, less 1 for (i in seq(along = Cuts[-1])) { # Get a 100,000 row chunk, skipping rows # as appropriate for each subsequent chunk Chunk <- scan("myfile", what = character(), sep = "\n", skip = Cuts[i], nlines = 100000) # set up a row sequence for the current # chunk Rows <- (Cuts[i] + 1):(Cuts[i + 1]) # Are any of the random values in the # current chunk? Chunk.Sel <- sel[which(sel %in% Rows)] # If so, get them if (length(Chunk.Sel) > 0) { Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ] # Now write them out write(Write.Rows, "myfile_short", append = TRUE) } } Regards, Marc ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.